ONR/FINAL  REPOR' 


G> 


mif>  r  .  - 

'■■■i  jj  i  !i_L 


CORY 


VALIDITY  STUDY  IN 
MULTIDIMENSIONAL  LATENT  SPACE 
AND  EFFICIENT  COMPUTERIZED 
ADAPTIVE  TESTING 


CM 

CM 

< 

i 

D 

< 


FUMIKO  SAMEJIMA 


UNIVERSITY  OF  TENNESSEE 


KNOXVILLE,  TENN.  37996-0900 


SEPTEMBER,  1990 


DT1C 

SELECT  E»% 
OCT  2  419901  1 

(3  B 


Prepared  under  the  contract  number  N00014-87-K-0320, 
4421-549  with  the 

Cognitive  Science  Research  Program 
Cognitive  and  Neural  Sciences  Division 
Office  of  Naval  Research 


Approved  for  public  release;  distribution  unlimited. 
Reproduction  in  whole  or  in  part  is  permitted  for 
any  purpose  of  the  United  States  Government. 


R01-1069-1 1-004-91 


SECuR'Tv  CLASS  QN  qc  thi$  PAGE 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No  0704  0188 


la  REPORT  SECURITY  CLASSIFICATION 


2a  SECURITY  CLASSIFICATION  AUTHORITY 


2b  DECLASSIFICATION  /  DOWNGRADING  SCHEDULE 


4  PERFORMING  ORGANIZATION  REPORT  NUMBER(S) 


1b  RESTRICTIVE  MARKiNGS 


3  Distribution /  availability  of  repor* 

Approved  for  public  release; 
Distribution  unlimited 


5  MONITORING  ORGANIZATION  RE °ORT  NoMBE  R(S. 


6b  OFFICE  SYM80L 
(if  applicable) 


3b  OFFICE  SYMBOL 
(if  applicable) 


6a  NAME  OF  PERFORMING  ORGANIZATION 

Fumiko  Samejima,  Ph.D. 
Psychology  Department 


6c  ADDRESS  ( City  State,  and  ZIP  Code) 

3 10B  Austin  Peay  Building 
The  University  of  Tennessee 
Knoxville,  TN  37996-0900 


8a  NAME  OF  FUNDING /SPONSORING 

organization  Cognitive  Science 
Research  Program 


8c  ADDRESS  (City.  State,  and  ZIP  Code) 

Office  of  Naval  Research 
800  N.  Quincy  Street 
Arlington,  VA  22217 


11  TITLE  (Include  Security  Classification) 

Validity  study  in  multidimensional  latent  space  and 
efficient  computerized  adaptive  testin 


12  PERSONAL  AuThOR(S) 

Fumiko  Samejima,  Ph.D. 


13b  Time  COVERED 


7a  JVAME.  OF.  MONJTOR'NG  ORGAN  Z  A  T  ON 

Cognitive  Science 
1142  CS 


7b  ADDRESS  (C/fy  State.  andZlPCode) 

Office  of  Naval  Research 
800  N.  Quincy  Street 


Arlington,  VA  22217 


PROCUREMENT  .NSTHUMENT  IDENTIFICATION  NUMBER 

N00014-87-K-0320 


io  source  of  funding  numbers 


PROGRAM 
ELEMENT  NO 

61153N 


PROJECT 

TASK 

NO 

NO 

RR-042-04 

042-04-01 

13a  type  of  REPORT 

final  report 


16  Supplementary  notation 


COSAT i  CODES 


i  14  DATE  OF  REPORT  (Year.  Month  Day)  'S  PAGE  CO-NT 

To  199Q  September  24,  1990  88 


18  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identity  o y  block  numoer) 


SUB-GROUP 


Latent  Trait  Models,  Mental  Test  Theory,  Multiple- 
Choice  Test,  Computerized  Adaptive  Testing,  Test 
Reliability,  Test  Validity,  Test  Information  Function , 


19  abstract  (Continue  on  reverse  if  necessary  and  identify  by  block  number)  Nonparametri  C  Estimation 

This  is  a  summary  of  the  research  conducted  in  the  past  three  years  and  seven 
months,  1987-90,  under  the  title,  "Validity  Study  in  Multidimensional  Latent 
Space  and  Efficient  Computerized  Adaptive  Testing." 


20  Distribution  availability  of  abstract 
El  UNClASSlFiED/UNLlMlTED  □  SAME  AS  RPT  Q  DTiC  USEPS 


22a  NAME  OF  RESPONSIBLE  INDIVIDUAL 

Dr.  Charles  E.  Davis 


OD  Form  1473,  JUN  86  Prenous  editioni  are  obtolete 

S/N  0 102-LF-0 14-6603 


2  I  ABSTRACT  SECURITY  ClASS'F 'CATiON 


.  2i  |J|;  Ct  S'  MR 

0NR-1142-CS 


PREFACE 


Three  and  a  half  years  have  passed  since  I  started  this  research  on  March  1,  1987.  During  this  period, 
so  many  things  were  designed  and  accomplished,  and  as  the  principal  investigator  I  find  it  extremely 
difficult  to  include  and  systematize  all  the  important  findings  and  implications  within  a  single  final 
report.  It  is  my  regret  that  many  of  them  have  to  be  left  out,  but  I  did  my  best  within  a  limited 
amount  of  time  with  the  hope  that  this  final  report  will  help  the  reader  to  grasp  the  outline  of  the 
whole  accomplishment. 

There  were  five  main  objectives  in  the  original  research  proposal,  and  they  can  be  summarized  as 
follows. 

[1]  Further  investigate  the  nonparametric  approach  to  the  estimation  of  the  operating  char¬ 
acteristics  of  discrete  item  responses. 

[2]  Revise  and  strengthen  the  package  computer  programs  and  eventually  implement  them  in 
the  Unix  Operating  System. 

[3]  Investigate  an  ideal  computerized  adaptive  testing  procedure  and  eventually  materialize  it 
in  the  SUN  microcomputer  system  networked  with  IBM  personal  computers. 

[4]  Investigate  multidimensional  latent  trait  theory. 

[5]  Pursue  item  validity  and  test  validity  using  the  multidimensional  latent  space. 

Out  of  these  objectives,  Objectives  [l]  and  [5],  together  with  Objectives  [2]  and  j3],  were  most  intensively 
pursued.  The  highest  productivity  belongs  to  this  part  of  the  research,  which  provides  us  with  valuable 
future  perspectives  of  research. 

During  the  research  period  there  were  many  people  who  helped  me  as  assistants,  secretaries,  etc., 
as  I  acknowledged  in  each  research  report.  Also  people  of  the  Office  of  Naval  Research,  especially  Dr. 
Charles  E.  Davis,  and  those  of  the  ONR  Atlanta  Office,  including  Mr.  Thomas  Bryant,  have  been  of 
great  help  in  conducting  the  research.  I  would  like  to  express  my  gratitude  to  all  of  them. 

Thanks  are  also  due  to  my  assistants,  Nancy  H.  Domm  and  Raed  A.  Hijer,  who  helped  me  in 
preparing  this  final  report.  Appreciation  is  also  extended  to  my  former  assistants,  Christine  A.  Golik 
and  Philip  S.  Livingston,  who  still  helped  me  occasionally  during  the  research  period. 
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I  Introduction 


This  is  the  final  report  of  the  multi-year  research  project  entitled  Validity  Study  in  Multidimensional 
Latent  Space  and  Efficient  Computerized  Adaptive  Testing ,  which  was  sponsored  by  the  Office  of  Naval 
Research  in  1987  through  1990  (N00014-87-K-0320).  The  accomplishments  include  those  which  have 
already  been  published  as  ONR  research  reports  as  well  as  those  still  in  progress,  which  will  be  published 
in  later  years  as  part  of  more  comprehensive  research  results. 

The  rest  of  this  chapter  will  describe  papers  published  or  presented  during  the  research  period,  and 
related  events.  The  contents  of  the  research  accomplishments  will  be  summarized  and  systematized, 
and  will  be  described  in  the  succeeding  chapters. 

[1.1]  Research  Reports 

The  following  are  the  ONR  research  reports  that  have  been  published  in  the  present  research  project. 

(1)  Modifications  of  the  Test  Information  Function.  Office  of  Naval  Research  Report  90-1, 

1990. 

(2)  Predictions  of  Reliability  Coefficients  and  Standard  Errors  of  Measurement  Using  the  Test 
Information  Function  and  its  Modifications.  Office  of  Naval  Research  Report  90-2,  1990. 

(3)  Validity  Measures  in  the  Context  of  Latent  Trait  Models.  Office  of  Naval  Research  Report 
90S,  1990. 

(4)  Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach  for  Estimating  the 
Operating  Characteristics  of  Discrete  Item  Responses.  Office  of  Naval  Research  Report 
90-4,  1990. 

(5)  Content-Based  Observation  of  Informative  Distractors  and  Efficiency  of  Ability  Estimation. 

Office  of  Naval  Research  Report  90-5,  1990. 

[1.2]  Special  Contribution  Paper 

During  this  period,  with  the  request  of  Dr.  Chikio  Hayashi,  president  of  the  Behaviormetric  Society, 
a  special  contribution  paper  entitled  Comprehensive  Latent  Trait  Theory  was  written  and  published  in 
Behaviormetrika ,  Vol.  24,  1988.  The  paper  is  based  upon  the  invited  address,  a  one  hour  special  lecture 
overviewing  latent  trait  models,  which  was  given  at  the  1987  Annual  Meeting  of  the  Behaviormetric 
Society  in  1987  at  Kyushu  University,  Fukuoka,  Japan,  under  the  title,  Overview  of  Latent  Trait  Models. 
There  were  more  than  two  hundred  researchers  in  the  audience,  and  the  summary  of  the  paper  is  given 
as  Appendix  B  of  the  author’s  ONR  Final  Report:  Advancement  of  Latent  Trait  Theory,  which  was 
published  in  1988. 

[1.3]  Paper  Presentations  at  Conferences 

There  are  thirteen  papers  presented  at  conferences  during  this  research  period,  excluding  those  in 
1987  which  have  been  reported  in  “  Final  Report:  Advancement  of  Latent  Trait  Theory.”  They  include 
ONR  contractors’  meetings,  and  are  listed  below. 

(1)  A  Robust  Method  of  On-Line  Calibration.  American  Educational  Research  Association 
Meeting,  New  Orleans,  1988.  U.  S.  A. 

(2)  Some  Modifications  of  the  On-Line  Item  Calibration  Methods.  ONR  Conference  on  Model- 
Based  Measurement,  Iowa  City,  1988.  U.  S.  A. 
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(3)  Information  Functions  of  the  General  Model  Developed  for  Differential  Strategies  and  Pos¬ 
sibilities  for  Applying  Half- Discrete,  Half- Continuous  Models  for  Projective  Techniques. 

ONR  Conference  on  Model-Based  Measvrement,  Iowa  City,  1988.  U.  S.  A. 

(4)  Some  Refinement  in  the  Estimation  of  the  Operating  Characteristics  of  Discrete  Item  Re¬ 
sponses  xoithout  Assuming  any  Mathematical  Form.  Psychometric  Society  Meeting,  Los 
Angeles,  1988.  U.  S.  A. 

(5)  Prospect  of  Analyzing  Rorschach  Data  by  Sophisticated  Psychometric  Methods.  Sympo¬ 
sium:  The  Burstein-Loucks  Rorschach  Scoring  System:  Clinical  and  Psychometric  De¬ 
velopments.  American  Psychological  Association  Annual  Meeting,  Atlanta,  1988.  U.  S. 

A. 

(6)  Latent  Trait  Approach  to  Rorschach  Diagnosis  Based  upon  the  Burstein-Loucks  Scoring 
System.  American  Educational  Research  Association  Annual  Meeting,  San  Francisco,  1989. 

U.  S.  A.  (round-table  session) 

(7)  Some  Considerations  on  Validity  Measures  in  Latent  Trait  Theory.  ONR  Conference  on 
Model-Based  Measurement,  Norman,  OK,  1989.  U.  S.  A. 

(8)  Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach  in  the  Estimation  of 
Operating  Characteristics  of  Discrete  Item  Responses.  ONR  Conference  on  Model-Based 
Measurement,  Norman,  OK,  1989.  U.  S.  A. 

(9)  Some  Reliability  and  Validity  Measures  in  the  Context  of  Latent  Trait  Models.  Psychome¬ 
tric  Society  Annual  Meeting,  Los  Angeles,  1989.  U.  S.  A. 

(10)  Prospect  of  Applying  Latent  Trait  Models  and  Methodologies  Accomodating  Both  Psycholog¬ 
ical  and  Neurological  Factors.  American  Educational  Research  Association  Annual  Meet¬ 
ing,  Boston,  1990.  U.  S.  A. 

(11)  Reliability /Validity  Indices  in  the  Context  of  Latent  Trait  Models.  American  Educational 
Research  Association  Annual  Meeting,  Boston,  1990.  U.  S.  A. 

(12)  Further  Considerations  for  the  Differential  Weight  Procedure  of  Estimating  the  Operating 
Characteristics  of  Discrete  Item  Responses.  ONR  Conference  on  Model-Based  Measure¬ 
ment,  Portland,  OR,  1990.  U.  S.  A. 

(13)  Modified  Test  Information  Functions,  Their  Usefulnesses  and  Prediction  of  the  Test  Reli¬ 
ability  Coefficient  Tailored  for  a  Specific  Ability  Distribution.  ONR  Conference  on  Model- 
Based  Measurement,  Portland,  OR,  1990.  U.  S.  A. 

[1.4]  Other  Events 

The  principal  investigator  gave  a  seminar  entitled  Comprehensive  Latent  Trait  Models  in  September, 
1989,  at  the  National  Center  for  University  Entrance  Examination,  Tokyo,  Japan,  invited  by  Dr.  Shuichi 
Iwatsubo  of  the  Center  and  Dr.  Kazuo  Shigematsu  of  the  Tokyo  Engineering  University. 

She  also  made  research  collaborations  with  Professor  Sukeyori  Shiba  of  the  University  of  Tokyo,  and 
with  Dr.  Takahiro  Sato  of  the  C  h  C  Information  Technology  Research  Laboratories  of  Nippon  Electric 
Company,  Japan. 
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II  Backgrounds  and  Basic  Concepts  Used  throughout  the  Re¬ 
search 

In  this  chapter,  the  backgrounds  and  the  basic  concepts  upon  which  the  present  research  has  been 
conducted  are  introduced.  The  reader  is  directed  to  the  auther’s  two  previous  ONR  final  reports 
(Samejima,  1981b,  1988)  and  other  ONR  research  reports,  if  he/she  wants  to  know  these  concepts  and 
developments  in  more  detail. 

[11. 1]  General  Concepts  in  Latent  Trait  Models 

Let  9  be  ability,  or  latent  trait,  which  assumes  any  real  number.  Let  g  (=  1,  2,  ■  ,  n)  denote  an 

item,  kg  be  any  discrete  item  response  to  item  g  ,  and  Pkg[9)  denote  the  operating  characteristic  of 
kg  ,  or  the  conditional  probability  assigned  to  kg  ,  given  6  ,  i.e., 

(2.1)  Pkg(9)  =  prob.\ka  \  9}  . 


We  assume  that  Pkg{9)  is  three-times  differentiable  with  respect  to  9  .  We  have  for  the  item  response 
information  function  (Samejima,  1972) 

(2.2)  Ik,(8)  =  -l~logPk9(9)  =  {-^Pkg(9)  {Pkg(6)}-'}*  -  ^Pkg{6)  [Pk,{8)\~1  , 

and  the  item  information  function  is  defined  as  the  conditional  expectation  of  Ikg[8)  ,  given  9  ,  such 
that 

(2.3)  Ig(9)  =  E[Ikg{9)  I  5]  =  ■ 

kg  kg 

In  the  special  case  where  the  item  g  is  scored  dichotomously,  this  item  information  function  is  simplified 
to  become 

(2.4)  m  =  [j-dPg{9)\il{Pg[9)}{l-Pg{6)}]-'  , 

where  Pg(8)  denotes  the  operating  characteristic  of  the  correct  answer  to  item  g  . 

Let  V  be  a  response  pattern  such  that 


(2.5)  V  =  {kgy  g=l,2,...,n  . 

The  operating  characteristic,  Pv(9)  ,  of  the  response  patten  V  is  defined  as  the  conditional  probability 
of  V  ,  given  9  .  Throughout  this  report  the  principle  of  local  independence  is  assumed  to  be  valid, 
so  that  within  any  group  of  examinees  all  characterized  by  the  same  value  of  the  latent  variable  8 
the  distributions  of  the  item  response  categories  are  all  independent  of  each  other.  Thus  the  operating 
characteristic  of  a  given  response  pattern  is  a  product  of  the  operating  characteristics  of  the  item 
response  categories  contained  in  that  response  pattern,  so  that  we  can  write 

(2.6)  m*)=  n  • 

k„iV 
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The  response  pattern  information  function,  Iy(6)  ,  (Samejima,  1972)  is  given  by 


(2-7)  lv(0)  =  ~logfy(e)=  £>,(*)  , 

kgtV 

and  the  test  information  function,  1(0)  ,  is  defined  as  the  conditional  expectation  of  Iy  (0)  ,  given  9  , 
and  we  obtain  from  (2.2),  (2.3),  (2.5),  (2.6)  and  (2.7) 


(2.8)  1(6 )  =  E[IV(6)  |  0\  =  £  Iv(8)Pv(6)  =  £/„(*)  • 

V  9=1 

[11. 2]  Critical  Observations  of  the  Reliability,  Standard  Error  of  Measure¬ 
ment  and  Validity  of  a  Test 

The  reliability  coefficient  and  the  standard  error  of  measurment  in  classical  mental  test  theory  are  two 
concepts  that  have  widely  been  accepted  and  used  by  psychologists  and  test  users  in  the  past  decades. 
The  author  has  pointed  out  repeatedly,  however,  that  these  measures  are  actually  the  attributes  of  a 
specified  group  of  examinees  as  well  as  of  a  given  test.  In  addition,  even  if  we  take  this  fact  into  account, 
representation  of  these  measures  by  single  numbers  results  in  over-simplification  and  the  lack  of  useful 
information  for  both  theorists  and  actual  users  of  tests.  In  contrast  to  this,  in  latent  trait  models, 
the  item  and  test  information  functions,  which  are  defined  by  (2.3)  and  (2.8),  respectively,  provide  us 
with  abundant  information  about  the  local  accuracy  of  estimation,  a  concept  which  is  totally  missing 
in  classical  mental  test  theory.  These  functions  are  population-free,  i.e.,  they  do  not  depend  upon  any 
specific  group  of  examinees  as  the  reliability  coefficient  and  the  standard  error  of  measurment  do. 

Unlike  the  progressive  dissolution  of  test  reliability,  test  validity  is  one  concept  that  has  rather 
been  neglected  in  the  context  of  latent  trait  models.  Several  types  of  validity  have  been  identified  and 
discussed  in  classical  mental  test  theory,  which  include  content  validity,  construct  validity,  and  criterion- 
oriented  validity.  Perhaps  we  can  say  that,  in  modern  mental  test  theory,  both  content  validity  and 
construct  validity  are  well  accomodated,  although  they  are  not  explicitly  stated.  If  each  item  is  based 
upon  cognitive  processes  that  are  directly  related  to  the  ability  to  be  measured,  then  the  content  of 
the  operationally  defined  latent  variable  behind  the  examinees’  performances  will  be  validated.  Also 
construct  validity  can  be  identified,  with  all  the  mathematically  sophisticated  structures  and  functions 
which  characterize  latent  trait  models  and  which  classical  mental  test  theory  does  not  provide.  With 
respect  to  the  criterion-oriented  validity,  however,  so  far  latent  trait  models  have  not  offered  so  much 
as  they  did  to  the  test  reliability  and  to  the  standard  error  of  measurement. 

In  classical  mental  test  theory,  the  validity  coefficient  is  again  a  single  number,  i.e.,  the  product- 
moment  correlation  coefficient  between  the  test  score  and  the  criterion  variable.  Since  the  correlation 
coefficient  is  largely  affected  by  the  heterogeneity  of  the  group  of  examinees,  i.e.,  for  a  fixed  test  the 
coefficient  tends  to  be  higher  when  individual  differences  among  the  examinees  in  the  group  are  greater, 
and  vice  versa  (cf.  Samejima,  1977b),  we  must  keep  in  mind  that  so-called  test  validity  represents  the 
degree  of  heterogeneity  in  ability  among  the  examinees  tested,  as  well  as  the  quality  of  the  test  itself. 

[11. 3]  Nonparametric  Approach  to  the  Estimation  of  the  Operating  Char¬ 
acteristics  of  Discrete  Item  Responses 

As  early  as  in  1977  the  author  proposed  Normal  Approximation  Method  (Samejima,  1977b)  which 
can  be  used  for  item  calibration  both  in  computerized  adaptive  testing  and  in  paper-and-pencil  testing. 
She  also  discussed  the  effective  use  of  information  functions  in  adaptive  testing  (Samejima,  1977a). 
Since  then,  with  the  support  by  the  Office  of  Naval  Research,  she  has  developed  several  approaches  and 
methods  for  the  same  purpose  (cf.  Samejima,  1977c,  1978a,  1978b,  1978c,  1978d,  1978e,  1978f,  1980a, 
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1980b,  1981a,  1981b,  1988;  Samejima  and  Changas,  1981) 
follows. 

.  For  convenience,  they  can  be  categorized 

Approaches 

Methods 

(1)  Bivariate  P.D.F.  Approach 

(l)  Pearson  System  Method 

(2)  Histogram  Ratio  Approach 

(2)  Two-Parameter  Beta  Method 

(3)  Curve  Fitting  Approach 

(3)  Normal  Approach  Method 

(4)  Conditional  P.D.F.  Approach 

(4)  Lognormal  Approach  Method 

(4.1)  Simple  Sum  Procedure 

(4.2)  Weighted  Sum  Procedure 

(4.3)  Proportioned  Sum  Procedure 

Here  by  an  approach  we  mean  a  general  procedure  in  approaching  the  operating  characteristics  of  a 
discrete  item  response,  and  by  a  method  we  mean  a  specific  method  in  approximating  the  conditional 
density  of  ability,  given  its  maximum  likelihood  estimate.  Thus  a  combination  of  an  approach  and  a 
method  provides  us  with  a  specific  procedure  for  estimating  the  operating  characteristic  of  a  discrete 
item  response. 

These  approaches  and  methods  are  characterized  by  two  features,  i.e., 

(1)  estimation  is  made  without  assuming  any  mathematical  forms  for  the  operating 
characteristics  of  discrete  item  responses,  and 


(2)  estimation  is  efficient  enough  to  base  itself  upon  a  relatively  small  set  of  data  of,  say, 
several  hundred  to  a  few  thousand  examinees. 

The  backgrounds  common  to  the  Bivariate  and  Conditional  Approaches  and  the  differences  among 
different  methods  can  be  described  as  follows.  For  the  sake  of  simplicity  in  handling  mathematics,  the 
tentative  transformation  of  6  to  r  is  made  by 


(2.9) 


=  c;1  f  [ i(t)}l'2dt  +  c0 

J -oo 


where  Co  is  an  arbitrary  constant  for  adjusting  the  origin  of  r  ,  and  Cy  is  an  arbitrary  constant 
which  equals  the  square  root  of  the  test  information  functions,  /*  (t)  ,  of  r  ,  so  that  we  can  write 


(2.10) 


<?i  =  in')i1/2 


for  all  r  .  This  transformation  will  be  simplified  if  we  use  a  polynomial  approximation  to  the  square 
root  of  the  test  information  function,  [/(0))1|/2  ,  in  the  least  squares  sense  which  is  accomplished  by 
using  the  method  of  moments  (cf.  Samejima  and  Livingston,  1979)  for  the  meaningful  interval  of  r  . 
Thus  (2.9)  can  be  changed  to  the  form 


(2.11)  r  =  Cf1  ^afc(^+ l)-‘^+1  +C„ 

k  =  0 

m+  1 

=  > 
k  =  0 

where  a*  (fc  =  0,  1, . . . ,  m)  is  the  k  -th  coefficient  of  the  polynomial  of  degree  m  approximating  the 
square  root  of  1(d)  ,  and  a‘k  is  the  new  k  -th  coefficient  which  is  given  by 
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k  =  0 


=  (Ci A:)  *  =  1,2 . m+1  . 

With  this  transformation  of  6  to  r  and  by  virtue  of  (2.10),  we  can  use  the  asymptotic  normality 
with  the  two  parameters,  r  and  C j"1  ,  as  the  approximation  to  the  conditional  distribution  of  the 
maximum  likelihood  estimator  t  ,  given  its  true  value  r  (cf.  Samejima,  1981b).  Then  the  first  through 
fourth  conditional  moments  of  r  ,  given  t  ,  can  be  obtained  from  the  density  function,  g*  (f )  ,  of  t 
and  from  the  constant  C\  by  the  following  four  formulae  (cf.  Samejima,  1981b): 

(2.13)  E(t  |  t)  =  f  +  Cf2^-  log  gp*  (r)  , 

(2.14)  Var.(r  |  r)  ==  Cf2[l  +  logoff)]  , 

(2.15)  E[{r  -  E(t  |  r)}3  |  f]  =  Cf6[^  log  **(?)] 
and 


(2.12) 


at 


(2.16)  E[{r~E(r  (f)}4  (  r|  =  C:*\3  +  6<?r2{^  logg*(f)}  +  3Cf4{^  log  g*(f)}2 

+  cr4{^7log9*(f)}]  • 

This  density  function,  s*(f)  ,  can  be  estimated  by  fitting  a  polynomial,  using  the  method  of  moments 
(cf.  Samejima  and  Livingston,  1979),  as  we  did  in  the  transformation  of  9  to  r  ,  based  upon  the 
empirical  set  of  f  's  .  Note  that  in  the  above  formulae  the  first  moment  is  about  the  orig>-  ,  while  the 
other  three  are  about  the  mean. 

The  two  coefficients,  0i  and  02  >  and  Pearson’s  criterion  k  are  obtained  by 

(2.17)  0i  =  Ma  Mz3  , 


(2.18)  /?2  =  M4M22 
and 

(2.19)  k  =  0^02 +  Z)2\4{202  - 30i  -6)(402  -  30i)}-1  , 

by  substituting  H2  ,  M3  and  by  V  ar.(r  |  f)  ,  E\{t  -  E(t  |  f)}3  |  f]  and  E\{t  -  E(t  )  r ) } 4  j  f] 
respectively,  which  are  obtained  by  formulae  (2.14),  (2.15)  and  (2.16). 

In  the  Bivariate  P.D.F.  Approach,  we  approximate  the  bivariate  distribution  of  the  transformed 
latent  trait  r  and  its  maximum  likelihood  estimate  t  for  each,  subpopulation  of  examinees  who  share 
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the  same  discrete  item  response  to  a  specified  item.  Thus  the  procedure  must  be  repeated  as  many  times 
as  the  number  of  discrete  item  response  categories  for  each  separate  item.  It  is  rather  a  time-consuming 
approach,  and  the  CPU  time  for  the  item  calibration  increases  almost  proportionally  to  the  number  of 
new  items. 

In  contrast  to  this,  Conditional  P.D.F.  Approach  deals  with  the  total  population  of  subjects,  and 
all  the  items  together.  Effort  is  focused  upon  the  approximation  of  the  conditional  distribution  of  r  , 
given  f  ,  for  the  total  population  of  examinees,  and  then  the  result  is  branched  into  separate  discrete 
item  response  subpopulations  for  each  item. 

If  we  compare  the  two  approaches  with  each  other,  therefore,  we  can  say  that  Bivariate  P.D.F. 
Approach  is  an  orthodox  approach,  while  Conditional  P.D.F.  Approach  needs  an  assumption  that  the 
conditional  distribution  of  r  ,  given  t  ,  is  unaffected  by  the  different  subpopulations  of  examinees. 
While  this  assumption  can  only  be  tolerated  in  most  cases,  the  latter  approach  has  two  big  advantages  in 
the  sense  that  the  CPU  time  required  in  item  calibration  is  substantially  less,  and  that  it  does  not  have 
to  deal  with  subgroups  of  small  numbers  of  subjects  in  approximating  the  joint  bivariate  distributions 
of  t  and  f  . 

In  each  of  these  two  approaches,  we  can  choose  one  of  the  four  methods  listed  earlier  in  estimating 
the  bivariate  density  of  r  and  t  ,  or  the  conditional  density  of  r  ,  given  its  maximum  likelihood 
estimate  f  .  In  so  doing,  in  the  Pearson  System  Method,  we  use  all  four  conditional  moments  of 
r  ,  given  f  ,  which  are  estimated  through  the  formulae  (2.13)  through  (2.16),  and,  using  Pearson’s 
criterion  k  ,  which  is  given  by  (2.19),  one  of  the  Pearson  System  density  functions  is  selected.  In  the 
Two-Parameter  Beta  Method  two  of  the  four  parameters  of  the  Beta  density  function,  i.e. ,  the  lower 
and  upper  endpoints  of  the  interval  of  r  for  which  the  Beta  density  is  positive,  are  a  priori  given,  and 
the  other  two  parameters  are  estimated  by  using  the  first  two  conditional  moments  of  r  ,  given  f  , 
which  are  provided  by  (2.13)  and  (2.14),  respectively.  In  the  Normal  Approach  Method,  again  we  use 
only  the  first  two  conditional  moments  of  r  ,  given  f  ,  as  the  first  and  second  parameters  of  the  normal 
density  function. 

If  we  compare  these  three  methods,  it  will  be  appropriate  to  say  that  both  Two-Parameter  Beta 
Method  and  Normal  Approach  Method  are  simpler  versions  of  Pearson  System  Method.  And  yet  the 
latter  two  methods  have  an  advantage  of  using  only  the  first  two  estimated  conditional  moments  of 
r  ,  given  f  ,  whereas  the  former  requires  the  additional  third  and  fourth  conditional  moments,  whose 
estimations  are  less  accurate  compared  with  those  of  the  first  two  conditional  moments.  If  we  compare 
the  Two-Parameter  Beta  Method  with  the  Normal  Approach  Method,  we  will  notice  that  the  former 
allows  non-symmetric  density  functions,  while  the  latter  does  not.  This  is  an  advantage  of  the  Two- 
Parameter  Beta  Method  over  the  Normal  Approach  Method,  and  yet  the  former  has  the  disadvantage 
of  the  requirement  that  two  of  the  four  parameters  should  a  priori  be  set. 

Lognormal  Approach  Method  was  developed  later,  which  uses  up  to  the  third  conditional  moment 
and  allows  more  flexibilities  in  the  shape  of  the  conditional  distribution  of  r  ,  given  t  ,  than  the  Normal 
Approach  Method.  It  was  intended  that  a  happy  medium  between  the  Pearson  System  Method  and  the 
Normal  Approach  Method  would  be  realized,  in  the  effort  of  ameliorating  the  disadvantages  of  these 
two  methods  and  of  keeping  their  separate  advantages. 

[II. 4]  Possible  Non-Monotonicities  of  the  Operating  Characteristics 

As  early  as  in  1968  the  author  wrote  about  and  discussed  the  conceivable  non-monotonicity  of  the 
operating  characteristic  of  the  correct  answer  of  the  multiple-choice  test  item,  which  is  based  strictly 
upon  theory  (cf.  Samejima,  1968).  Since  then,  such  a  phenomenon  has  actually  been  observed  with 
empirical  data.  For  example,  Lord  and  Novick  reported  such  a  curve  when  they  plotted  the  percent  of 
the  correct  answer  against  the  test  score  for  each  item  as  an  approximation  to  the  item  characteristic 
function  (cf.  Lord  and  Novick,  1968,  Chapter  16).  Since,  as  their  Theorem  16.4.1  states,  the  average , 
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over  all  items,  of  the  sample  item-test  regressions  falls  along  a  straight  line  through  the  origin,  unt/i 
forty-five  degree  slope,  such  a  dip  cannot  be  detected  for  an  easy  item  even  if  it  exists,  as  far  as  we  use 
the  item-test  regression  as  an  approximation.  It  is  quite  possible,  therefore,  that  there  are  more  than 
one  item  among  those  items  that  have  such  dips;  only  they  were  not  detected. 

In  the  past  years  various  sets  of  data  based  upon  the  Vocabulary  Subtest  of  the  Iowa  Tests  of  Basic 
Skills,  upon  Shiba’s  Word/Phrase  Comprehension  Tests,  ASVAB  Tests  of  Word  Knowledge  and  of  Math 
Knowledge,  etc.,  have  been  analyzed  by  using,  mainly,  the  Simple  Sum  Procedure  of  the  Conditional 
P.D.F.  Approach  combined  with  the  Normal  Approach  Method  (cf.  Samejima,  1981b). These  tests 
consist  of  multiple-choice  test  items,  with  four  or  five  alternative  answers  in  each  item.  As  the  result, 
we  have  discovered  non-monotonic  operating  characteristics  of  the  correct  answer  for  some  of  the  items, 
as  well  as  differential  information  coming  from  the  estima  ed  operating  characteristics  of  the  incorrect 
alternative  answers,  which  are  called  plausibility  functions. 

Such  discoveries  of  non-monotonic  operating  characteristics  can  best  be  accomplished  by  using  a 
nonparametric  approach  to  the  estimation  of  the  operating  characteristics.  After  the  operating  charac¬ 
teristics  have  been  discovered  by  using  the  nonparametric  approach,  however,  it  may  be  wise  to  search 
for  mathematical  models  that  fit  the  results,  and  to  estimate  item  parameters  accordingly,  so  that  we 
shall  be  able  to  take  advantage  of  the  mathematical  simplicity  coming  from  the  parameterization. 
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Ill  Proposal  of  Two  Modification  Formulae  of  the  Test  Infor¬ 
mation  Function 

Although  the  reciprocal  of  the  test  information  function  1(6)  provides  us  with  a  minimum  variance 
bound  for  any  unbiased  estimator  of  6  (cf.  Kendall  and  Stuart,  1961),  since  the  maximum  likelihood 
estimate,  which  is  denoted  by  6y  ,  is  only  asymptotically  unbiased,  for  a  finite  number  of  items  we 
need  to  examine  if  the  bias  of  6V  of  a  given  test  over  the  meaningful  range  of  6  is  practically  nil, 
before  we  consider  this  reciprocal  as  a  minimum  variance  bound.  It  has  been  shown  (Samejima,  1977a, 
1977b)  that  in  many  cases  the  conditional  distribution  of  6y  ,  given  6  ,  converges  to  N(6,  [/(0)]-1/2) 
relatively  quickly.  On  the  other  hand,  we  have  also  noticed  that  the  speed  of  convergence  is  not  the  same 
even  if  the  amount  of  test  information  is  kept  equal.  This  has  been  demonstrated  by  using  Constant 
Information  Model  (Samejima,  1979a),  which  is  represented  by 


(3.1)  Pg(6)  =  sin2[atf(0  -  bg)  +  (*/4)]  , 

where,  as  before,  Pg(6 )  denotes  the  operating  characteristic  of  the  correct  answer,  and  ag  (>  0)  and 
bg  are  the  item  discrimination  and  difficulty  parameters,  respectively.  This  model  provides  us  with  a 
constant  amount  of  item  information  Ig(9)  which  equals  4 a2  for  the  interval  of  6  , 

(3.2)  -  Jr(4atf]_1  +  bg  <  6  <  7r[4atf]_1  +  bg 


(cf.  Samejima,  1979b). 

Thus  two  modification  formulae  of  the  test  information  function  1(6)  have  been  proposed  in  the 
present  research  in  order  to  provide  better  measures  of  local  accuracies  of  the  estimation  of  9  ,  when 
the  maximum  likelihood  estimation  is  used.  They  start  from  the  search  for  a  minimum  variance  bound, 
and  from  a  minimum  bound  of  the  mean  squared  error,  of  any  estimator,  biased  or  unbiased. 
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[III.l]  Minimum  Variance  Bound 

Let  9y  denote  any  estimator  of  9  .  We  can  write  in  general 
(3-3)  E(9^  j  0)  =  0  +  E\(9l  -  0)  |  8}  . 


When  the  item  responses  are  discrete,  we  have 


(3-4)  E(6'v  |  0)  =  £  9'v  Lv{6)  =  £  6'v  TV  (0)  , 

v  V 

where  Lv(8)  denotes  the  likelihood  function.  Differentiating  both  sides  of  (3.4)  with  respect  to  6  , 
we  obtain 


(3.5)  ~-tm in  -  fwi  =  E  *ji^/v(*)i 

=  EW-£«l«)||^/V(»)|. 

We  can  write 

(3-6)  ^(^  =  [^108  M0))M9)  , 

and  using  this  we  can  rewrite  (3.5)  into  the  form 

(3-7)  ~E{8'V  |  0)  =  Y,  K  ~  E(0$  I  *)]  log  Pv  (8)1  Pv  (6)  . 

FVom  this  result,  by  the  Cramer-Rao  inequality,  we  obtain 

(3-8)  [  W  I  «)]2  <  ^ar.(^  |  9)  £[{^  log  TV  (0)}2  |  8}  - 

Since  we  can  write 

(3-9)  E\{j-e\ogLv(8)}2  \0)  =  -E\~\ogLv(e)  \6)  , 

from  this,  (2.7),  (2.8)  and  (3.3)  we  can  rewrite  and  rearrange  the  inequality  (3.8)  into  the  form 

(3.10)  Var.(0l  |  9)  >  [~E(0^  \  0)\>  [/(0)]-1  =  [l  +  ± E(9'v  -  8  \  9)\>  ( /(*)]-*  , 

whose  Tightest  hand  side  provides  us  with  the  minimum  variance  bound  of  the  conditional  distribution 
of  any  estimator  6y  .  When  0y  is  biased,  the  size  of  the  minimum  variance  bound  is  determined  by 
the  second  term  of  the  first  factor  of  the  minimum  bound,  and  the  result  can  be  greater  or  less  than 
the  reciprocal  of  the  test  information  function  depending  upon  the  sign  of  this  partial  derivative. 
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[III. 2]  First  Modified  Test  Information  Function 

Lord  has  proposed  a  bias  function  for  the  maximum  likelihood  estimate  of  9  in  the  three-parameter 
logistic  model  whose  operating  characteristic  of  the  correct  answer,  Pg(8)  ,  is  given  by 

(3.11)  Pg{6)  =  cg  +  (1  -  e„)[l  +  exp{-L>a9(0  -  6a)}]_  1  , 

where  ag  ,  bg  ,  and  cg  are  the  item  discrimination,  difficulty,  and  guessing  parameters,  and  D  is  a 
scaling  factor,  which  is  set  equal  to  1.7  when  the  logistic  model  is  U3ed  as  a  substitute  for  the  normal 
ogive  model.  Lord’s  bias  function  B(9y  j  9)  can  be  written  as 

(3.12)  B(9V\8)  =  D[/Wr2LaB/fl(«)[^W-|]  , 

0=1 

where 

(3.13)  tp„(d)  =  [1  +  exp{-Dag(9  -  6„)})-1 


(cf.  Lord,  1983).  We  can  see  in  the  above  formula  of  the  MLE  bias  function  that  the  bias  should  be 
negative  when  if>g(8)  is  less  than  0.5  for  all  the  items,  which  is  necessarily  the  case  for  lower  values  of 
6  ,  and  should  be  positive  when  ipg(6)  is  greater  than  0.5  for  all  the  items,  i.e.,  for  higher  values  of 
9  ,  and  in  between  the  bias  tends  to  be  close  to  zero,  for  the  last  factor  in  the  formula  assumes  negative 
values  for  some  items  and  positive  values  for  some  others,  provided  that  the  difficulty  parameter  bg 
distributes  widely. 

In  the  general  case  of  discrete  item  responses,  we  obtain  for  the  bias  function  of  the  maximum 
likelihood  estimate  (cf.  Samejima,  1987) 


(3.14)  B{9V  |  9)  =  E[9V  -9\9\  =  -(l/2)[/(0)]-2  £  £  Ak,  (6)P'k't(e) 

0=1  k, 

0  =  1  k, 

where  Akg{8)  is  the  basic  function  for  the  discrete  item  response  kg  ,  and  P'k  (0)  and  P'k' [8)  denote 
the  first  and  second  partial  derivatives  of  Pk,(9)  with  respect  to  9  ,  respectively.  On  the  graded 
response  level  where  item  scop*  xg  assumes  successive  integers,  0  through  m„  ,  each  ku  in  the 
above  formula  must  be  replaced  by  the  graded  item  score  xg  (cf.  Samejima,  1969,  1972).  On  the 
dichotomous  response  level,  it  can  be  reduced  to  the  form 


(3.15)  B(9V  |  8)  =  E\9v-e\8)  =  (-l/2)[/(0)]-2  £  /,(*)i?(*)[^(*)r: 1  , 

0=1 

with  P'g{8)  and  Pg{9)  indicating  the  first  and  second  partial  derivatives  of  Pg(9)  with  respect  to 
6  ,  respectively.  This  formula  includes  Lord’s  bias  function  in  the  three-parameter  logistic  model  as  a 
special  case. 

We  can  rewrite  the  inequality  (3.10)  for  the  maximum  likelihood  estimate  9y 
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(3.16) 


Var.(6v  \9)>{1+  ^B(ffv  \B)\2  [1(6)]-'  . 

Taking  the  reciprocal  of  the  right  hand  side  of  (3.16),  which  is  an  approximate  minimum  variance  bound 
of  the  maximum  likelihood  estimator,  a  modified  test  information  function,  T(0)  ,  is  proposed  by 

(3.i7)  r(d)  =  i(e)[i  +  ^B(§v  |*))-2  . 

FYom  this  formula,  we  can  see  that  the  relationship  between  this  new  function  and  the  original  test 
information  function  depends  upon  the  first  derivative  of  the  MLE  bias  function.  If  the  derivative  is 
positive,  then  the  new  function  will  assume  a  lesser  value  than  the  original  test  information  function; 
if  it  is  negative,  then  this  relationship  will  be  reversed;  if  it  is  zero,  i.e.,  if  the  MLE  is  unbiased,  then 
these  two  functions  will  assume  the  same  value.  We  can  write  from  (3.14)  for  the  general  form  of  the 
derivative  of  the  MLE  bias  function 


(3.18)  ^B(§v  1 9)  =  {Jp)}-M(l/2){/(*)}-‘ 

EE<7*. (t)Pk.(8)  -  n.WPkimPi'.it)}-1)  -  2 b($v  i  *)/'(*)] , 

8=1  kg 

where  P^{9)  and  P (9)  denote  the  third  and  the  first  derivatives  of  Pkg(9)  and  1(9)  with  respect 
to  9  ,  respectively.  It  is  obvious  from  (2.3)  and  (2.8)  that  we  have 

(3.19)  w  =  E  pkgmny){Pk,  wr1  -  /*,  wi 


(3.2o)  v(6)  =y,  w  =  EE  KM pkMp^M~l-^M  . 

8=1  8=1  kg 

where  Pg(9)  is  the  first  derivative  of  the  item  information  function  /„(#)  with  respect  to  9  .  For  a 
set  of  dichotomous  items  (3.18)  becomes  simplified  into  the  form 


(3.21)  £b(9v  |  9)  =  {I(9)rl\(l/2){/(9)}-1  £{PB(*)}~a0  ~  PA6))~9 

v=  i 

({1  -  2 Pa{9)){P'g{9))2P'f;(9)  -  Pg(9){  1  -  Pg(9)}({P''(9)}'i  +  P’(9)P^"(9))) 
-  2 B(9V  |  9)  V(6)  |  , 


where  B(9y  \  9)  is  given  by  (3.15). 


[III. 3]  Minimum  Bound  of  the  Mean  Squared  Error 

When  the  estimator  9y  is  conditionally  biased ,  however  small  the  conditional  variance  may  be,  it 
does  not  reflect  the  accuracy  of  estimation  of  9  .  Thus  the  mean  squared  error ,  E\(9y  -  9)~  j  9]  , 

becomes  a  more  important  indicator  of  the  accuracy.  W°  can  write  for  the  mean  squared  error 
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(3.22) 


mv  -  *)2 


A)  =  Var.[6l  I  A)  +  [£(fly 


A)  -  A j2 


(cf.  Kendall  and  Stuart,  1961).  We  can  see  in  this  formula  that  the  mean  squared  error  equals  the 
conditional  variance  if  fly  is  unbiased,  and  is  greater  than  the  variance  when  fly  is  biased.  From  this 
and  the  inequality  (3.10)  we  obtain  for  the  minimum  bound  of  the  mean  squared  error 

(3.23)  E\(8'v  -  A)2  |  Aj  >  [1  +  yeE{6-v  -  A  |  A)]2  [/(A)]"1  +  (E(8'v  |  A)  -  A)2  . 

Note  that  this  inequality  holds  for  any  estimator,  Ay  ,  of  A  . 

[III.4]  Second  Modified  Test  Information  Function 

For  the  maximum  likelihood  estimate  Ay  ,  we  can  rewrite  the  inequality  (3.23)  by  using  the  MLE 
bias  function,  which  is  given  by  (3.14),  to  obtain 

(3.24)  E{(8V  -  A)2  |  A)  >  [1  +  yQB(8v  \  A))2  [/(A)]"1  +  \B(8V  |  A)]2  . 

Taking  the  reciprocal  of  the  right  hand  side  of  (3.24),  which  is  an  approximate  minimum  bound  of 
the  mean  squared  error  of  the  maximum  likelihood  estimator,  the  second  modified  test  information 
function,  5(A)  ,  is  proposed  by 

(3.25)  5(A)  =  1(8)  {[1  +  ± B(8v  |  A)]2  +  1(8)  \B(8V  \  A))2}"1  . 

We  can  see  that  the  difference  between  the  two  modification  formulae  of  the  test  information  function, 
which  are  defined  by  (3.17)  and  (3.25),  respectively,  is  the  second  and  last  term  in  the  braces  of  the 
right  hand  side  of  the  formula  (3.25).  Since  this  term  is  nonnegative,  there  is  a  relationship 


(3.26) 


5(A)  <T(A)  , 


throughout  the  whole  range  of  A  ,  regardless  of  the  slope  of  the  MLE  bias  function.  If  there  is  a  range 
of  A  where  the  maximum  likelihood  estimate  is  unbiased,  then  we  will  have  for  that  range  of  A 

(3.27)  5(A)  =  T(fl)  =  1(8)  . 


Since  under  a  general  condition  the  maximum  likelihood  estimator  Ay  is  asymptotically  unbiased,  as 
the  number  of  items  approaches  positive  infinity,  (3.27)  holds  asymptotically  for  all  A  . 


[III. 5]  Examples 

Samejima  has  applied  formula  (3.15)  for  the  MLE  bias  functions  of  the  Iowa  Level  11  Vocabulary 
Subtest  and  Shiba’s  Test  Jl  of  Word/Phrase  Comprehension,  based  upon  the  set  of  data  collected  for 
2, 356  and  2, 259  subjects,  respectively.  These  tests  have  forty-three  and  fifty-five  dichotomously 
scored  items,  respectively,  and  following  the  normal  ogive  model,  whose  operating  characteristic  for  the 
correct  answer  is  given  by 
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FIGURE  3-1  THETA 


MLE  Bias  Functions  of  the  Iowa  Level  11  Vocabulary  Subtest  (Solid  Line)  and  of  Shiba’s 
Test  Jl  of  Word/Phrase  Comprehension  (Dashed  Line),  Following  the  Normal 

Ogive  Model. 


e~u  12  du  , 

■  OO 

the  discrimination  and  difficulty  parameters  were  estimated  (Samejima,  1984a,  1984b).  The  resulting 
MLE  bias  functions  are  illustrated  in  Figure  3-1.  We  can  see  that  in  each  of  these  two  examples  there 
is  a  wide  range  of  8  ,  i.e.,  approximately  (-2.0,  1.5),  for  which  the  maximum  likelihood  estimate  of 
8  is  practically  unbiased.  The  amount  of  bias  is  especially  small  for  Shiba’s  Test  Jl.  Although  this 
feature  indicates  good  qualities  of  these  tests,  we  still  have  to  expect  some  biases  when  these  tests  are 
administered  to  groups  of  examinees  whose  ability  distributes  on  the  relatively  lower  side  or  on  the 
relatively  higher  side  of  the  ability  scale. 

When  the  MLE  bias  function  of  the  test  is  monotone  increasing,  as  are  those  illustrated  in  Figure  3-1, 
it  is  obvious  from  (3.17)  that  T(0)  will  assume  lesser  values  than  those  of  the  original  test  information 
function  1(8)  for  lower  and  higher  levels  of  9  ,  while  these  two  functions  are  practically  identical  in 
between.  The  same  applies  to  5(0)  ,  and  we  have  the  relationship, 

(3.29)  =(*)  <  T(0)  <  1(9)  , 


throughout  the  whole  range  of  9  . 

In  the  normal  ogive  model,  differentiating  (3.28)  twice  with  respect  to  8  and  rearranging,  we  obtain 
(3.30)  P^B)  =  j2ff]-l''2a„  exp[  — (1/2)  a2(0  -  J>„)2] 

and 
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(3.31) 


Substituting  (3.30)  and  (3.31)  into  (3.15)  and  rearranging,  we  can  write  for  the  MLE  bias  function 
following  the  normal  ogive  model  on  the  dichotomous  response  level 

(3.32)  B{$v  |  0)  =  (1/2)  [/(0)j-2  J2  aUff  *  fc«)  W)  • 

0  =  1 


Differentiating  (3.32)  with  respect  to  9  f  we  obtain 


(3.33)  fjB(6v  |  0)  =  [/(^)]-2[(l/2)  £  a2  1 1'(0)(0  -  6J  +  Ig(0)} 

0=1 

-\I(8)}-1  I'(9)  £  a»(*-t#)  /„(*)]. 

0=1 


It  is  obvious  from  (2.4),  (2.8)  and  (3.31)  that  we  have 

(3.34)  11(0)  =  1,(0)  (P;(fl)  {2Pfl(tf)  -  1}  (Pg(0){  1  -  P,^)})-1  -  2a2(0  -  fcj] 
and 

(3.35)  /'(<?)  =  £  It(9)  [/*(*)  {2Pg(0)  -  1}  (P„W{1  -  P^)})-1  -  2a2(^  -  b g)\  . 

9=1 

Figure  3-2  shows  the  square  roots  of  the  original  and  the  two  modified  test  information  functions 
for  the  Iowa  Level  11  Vocabulary  Subtest  and  for  Shiba’s  Test  Jl  of  Word/Phrase  Comprehension, 
following  the  normal  ogive  model.  In  each  of  these  figures,  the  curves  respresenting  the  results  of  the 
two  modification  formulae  assume  lower  values  than  the  square  root  of  the  original  test  information 
function  for  all  0  ,  as  was  expected  from  the  shape  of  the  MLE  bias  function  in  Figure  3-1.  The 
discrepancies  between  the  results  of  the  two  modification  formulae  are  small,  however,  in  each  figure. 

In  the  three-parameter  logistic  model,  the  operating  characteristic  of  the  correct  answer  is  given  by 
the  formula  (3.11),  and  Lord's  MLE  bias  function  for  the  three-parameter  logistic  model,  which  is  given 
by  (3.12),  is  readily  applicable.  Differentiating  (3.11)  three  times  with  respect  to  0  and  rearranging, 
we  can  write 

(3.36)  P'g(0)  =  (1  -  c9)  Dag  <Pg(0)  (1  -  *,(*)]  , 


(3.37)  P?(0)  =  (1  -  eg)  D2ag  ^g(0)  (1  -  *,(*)]  [1  -  2^(0)]  =  Dag  P'g(0)\  1  -  21>g(0)\ 


and 

(3.38)  p;'(e)  =  D2a2  p;(d)|i  -  WA&)  +  Hum2)  - 
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SORT  OF  TEST  INFORMATION  SORT 


THETA 

FIGURE  3-2 


Square  Roots  of  the  Original  (Solid  Line)  and  the  Two  Modified  (Dashed  and  Dotted  Lines) 
Test  Information  Functions  of  the  Iowa  Level  11  Vocabulary  Subtest,  and  Those  of  Shiba’s 
Test  Jl  of  Word/Phrase  Comprehension,  Following  the  Normal  Ogive  Model. 
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where  ipg(0)  is  defined  by  (3.13).  Substituting  (3.36)  into  (2.4)  and  rearranging,  we  obtain  for  the 
item  information  function 

(3.39)  Ig{9)  =  (1  -  e0)  {*g(9 ) }2  [l  -  j,g(0)\  [c,  +  (l  -  cg)  *(9)]-'  . 

This  and  (2.8)  will  enable  us  to  evaluate  Lord’s  MLE  bias  function  given  by  (3.12).  Differentiating 
(3.12)  with  respect  to  9  and  rearranging,  we  can  write 

(3.4°)  =  D{I(9)}-^  agI'g(9){^)~m} 

17=1 

+  of^  aim um-uo)) 

a=i 

n 

-  2  I'(9)  {/(«)}-1  £  a9  Ig{9)  {U9)  ~  (1/2)}]  . 

8=1 

We  also  obtain  from  (2.4),  (3.11)  and  (2.8)  the  first  derivatives  of  the  item  and  the  test  information 
functions  with  respect  to  9  so  that  we  have 

(3.41)  I'g{6)  =  (1  -  cg)  D3a3g  {rPg(9)}2  [1  -  *,(*)]  { Pg(9 )}~' 

(2  -  s us)  -  a  -  cg)  um  -  umpM)-1] 

=  Dag  Ig{9)  [2{l  -  U9))  ~  U0){P*[e)}-'\ 

and 

n 

(3.42)  I‘{6)  =  DJ2  ag  lg(9)  [2{1  -  4 g{9 )}  -  *,(«){P„(*)r  M  , 

17=1 

and  we  can  use  these  two  results  in  (3.40)  in  order  to  evaluate  B(9y  |  9)  ■ 

When  cg  =  0  ,  i.e.,  for  the  original  logistic  model  on  the  dichotomous  response  level,  these  formulae 
become  much  more  simplified,  and  we  can  write 


(3.43)  Pg[9)  =  [1  +  exp{-Dag(9  -  6„)}j  1  =  ig{6)  , 

(3.44)  P'g(6)  =  Dag  U<>)  [1  -  *,(*)]  , 

(3.45)  P''(9)  -  D7a2g  ig{6)  (l  -  *„(*)]  [l  -  2^(0)|  =  Dag  /*(*)[  1  -  2*g(9)\ 

(3.46)  P’S (9)  =  D3ag  *t{9)  (l  -  *,(*)]  [1  -  6 *„(*)  +  6{<M*)}2]  , 

(3.47)  Ig(9)  =D2a]  *,(*)  [!-*„(*)]  , 
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THETA 

FIGURE  3-3 


MLE  Bias  Functions  of  the  Hypothetical  Test  of  Thirty-Five  Graded  Test  Items  Following 
the  Normal  Ogive  Model  (Solid  Line)  and  the  Logistic  Model  (Dashed  Line). 


(3.48)  l'g{6)  =  D3a3g  ^(<9)(l  -  ^(tf)l(l  -  2^(0))  =  D  ag  Jg{6)  [l  -  2 *t(9)\ 

(3.49)  1(6)  =  D*  j2al  *«(*)  l1  "  lM*)I 


(3.50)  I'(6)  ag  Ig(0)  [1  -  2^(0)]  , 

9=1 

respectively.  Thus  the  two  modified  test  information  functions,  T(0)  and  E(0)  ,  which  are  defined 
by  (3.17)  and  (3.25),  can  be  evaluated  accordingly,  both  for  the  original  logistic  model  and  for  the 
three- parameter  logistic  model. 

The  reader  is  directed  to  ONR/RR-90-1  (cf.  Samejima,  1990)  for  the  MLE  bias  functions  and 
the  square  roots  of  the  original  and  the  two  modified  test  information  functions  of  the  Iowa  Level  11 
Vocabulary  Subtest  and  of  Shiba’s  Test  Jl  of  Word/Phrase  Comprehension,  following  the  logistic  model 
by  using  the  same  sets  of  estimated  item  parameters  and  by  setting  D  =  1.7  .  These  results  are  similar 
to  those  following  the  normal  ogive  model,  which  are  presented  by  Figures  3-1  and  3-2,  except  that 
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che  square  roots  of  the  original  and  the  modified  test  information  functions  are  a  little  steeper,  the 
characteristic  of  the  logistic  model  in  comparison  with  the  normal  ogive  model. 

In  the  homogeneous  case  of  the  graded  response  level  (Samejima,  1969,  1972),  the  general  formula 
for  the  operating  characteristic  of  the  item  score  xg  (=  0, 1, ...,  mg)  is  given  by 


(3.51) 

p*9(8)  =  kav)  -  p;g+l($)  . 

where 

fag(8-bXg) 

(3.52) 

p;,(°)=  dt , 

J  —  oo 

(3.53) 

-  oo  =  b0  <  bx  <  b-2  <  ...  <  bm}  <  bmg  +  i  =  oo 

and  4>g(t)  is  some  specified  density  function.  When  we  replace  the  right  hand  side  of  (3.52)  by  that  of 
(3.28)  with  bg  replaced  by  and  use  the  result  in  (3.51),  we  have  the  operating  characteristic  of 
xg  in  the  normal  ogive  model  on  the  graded  response  level;  when  we  do  the  same  thing  using  the  right 
hand  side  of  (3.13),  we  obtain  the  operating  characteristic  of  xg  in  the  logistic  model  on  the  graded 
response  level. 

A  hypothetical  test  of  thirty-five  graded  items,  with  three  graded  score  categories  each,  which  gives 
an  approximately  constant  amount  of  test  information  for  the  interval  of  9  ,  (-3,  3),  has  been  used 
repeatedly  in  the  author’s  research  (cf.  Samajima,  1981,  1988).  Figure  3-3  presents  the  MLE  bias 
functions  for  this  hypothetical  test,  following  the  normal  ogive  model  and  the  logistic  model  on  the 
graded  response  level,  respectively.  We  can  see  that  a  practical  unbiasedness  holds  for  a  very  wide 
range  of  9  in  both  cases,  as  is  expected  for  a  set  of  graded  test  items  whose  response  difficulty  levels 
are  widely  distributed,  an  advantage  of  graded  responses  over  dichotomous  responses.  We  also  notice 
that  these  two  MLE  bias  functions  are  almost  indistinguishable  from  each  other.  Figure  3-4  presents  the 
square  roots  of  the  original  and  the  two  modified  test  information  functions  of  this  hypothetical  test  of 
graded  items,  following  the  normal  ogive  model  and  the  logistic  model.  As  is  expected,  the  differences 
among  the  three  functions  are  small  for  a  wide  range  of  9  in  both  cases.  It  is  interesting  to  note, 
however,  that  in  these  figures  the  square  roots  of  the  modified  test  information  functions  assume  higher 
values  than  the  square  root  of  the  original  test  information  function  at  certain  points  of  9  ,  and  this 
tendency  is  especially  conspicuous  in  the  results  of  the  logistic  model.  This  comes  from  the  fact  that 
the  MLE  bias  functions,  which  are  presented  in  Figure  3-3  for  both  models,  have  tiny  ups  and  downs, 
and  they  are  not  strictly  increasing  in  9  . 

In  each  of  the  examples  given  above,  the  difficulty  parameters  of  these  items  in  each  test  distribute 
widely  over  the  range  of  9  of  interest,  and  this  fact  is  the  main  reason  that  the  MLE  bias  function 
assumes  relatively  small  values  for  a  wide  range  of  9  .  We  also  notice  that  the  resulting  two  modified 
test  information  functions  are  reasonably  close  to  the  original  test  information  function. 

For  the  sake  of  comparison,  Figure  3-5  presents  the  MLE  bias  function  and  the  square  roots  of  the 
original  and  the  two  modified  test  information  functions,  for  a  hypothetical  test  of  thirty  equivalent, 
dichotomous  ;tems  with  the  common  item  parameters,  ag  =  1.0  and  bg  =  0.0  ,  following  the  logistic 
model.  We  can  see  in  the  first  graph  of  Figure  3-5  that  the  amount  of  bias  increases  rapidly  outside 
the  range  of  9  ,  (—1.0,  1.0)  .  The  resulting  square  roots  of  the  two  modified  test  information  functions 
demonstrate  substantially  large  decrements  from  the  original  j/(fl)]I|/2  outside  this  interval  of  9  ,  as 
we  can  see  in  the  second  graph  of  Figure  3-5. 

We  also  notice  that  in  all  these  examples  there  are  not  substantial  differences  between  the  results 
of  the  two  modification  formulae.  This  indicates  that  in  these  examples  it  does  not  make  so  much 


19 


SORT  OF  TEST  INFORMATION  SORT  OF  TEST  INFORMATION 


Square  Roots  of  the  Original  (Solid  Line)  and  the  Two  Modified  (Dashed  and  Dotted  Lines) 
Test  Information  Functions  of  the  Hypothetical  Test  of  Thirty-Five  Graded  Test  Items 
Following  the  Normal  Ogive  Model  and  the  Logistic  Model,  Respectively. 
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MLE  BIAS  FUNCTION 


-5.0  -4.0  -3.0  -2.0  -1.0  0.0  1.0  2.0  3.0  4.0  5.0 

THETA 


-5.0  -4.0  -3.0  -2.0  -1.0  0.0  1.0  2.0  3.0  4.0  5.0 

THETA 

FIGURE  3-5 


MLE  Bias  Function  of  the  Hypothetical  Test  of  Thirty  Equivalent  Test  Items  Following 
the  Logistic  Model  with  ag  =  1.0  and  bg  =  0.0  As  the  Common  Parameters  (Above), 
and  Square  Roots  of  the  Original  (Solid  Line)  and  the  Two  Modified  (Dashed  and 
Dotted  Lines)  Test  Information  Functions  of  the  Same  Test  (Below). 


difference  if  we  choose  Modification  Formula  No.  1  or  Modification  Formula  No.  2.  We  should  not 
generalize  this  conclusion  to  other  situations,  however,  until  we  have  tried  these  modification  formulae 
on  different  types  of  data  sets. 


[III.6]  Minimum  Bounds  of  Variance  and  Mean  Squared  Error  for  the 
Transformed  Latent  Variable 

Since  most  psychological  scales,  including  those  in  latent  trait  models,  are  subject  to  monotune 
transformation,  we  need  to  consider  information  functions  that  are  based  upon  the  transfomed  latent 
variable.  Let  r  denote  a  transformed  latent  variable,  i.e., 

(3.54)  r  =  t{6)  . 

We  assume  that  r  is  strictly  increasing  in,  and  three  times  differentiable  with  respect  to,  8  ,  and  vice 
versa.  We  have  for  the  operating  characteristic,  -F£a(r)  ,  of  the  discrete  item  response  kg  ,  which  is 
defined  as  a  function  of  r  , 


(3.55)  K^)  ~  prob.[kg\r]  =  prob.[kg  \  8\  =  Pkg{9)  , 

and  by  local  independence  we  can  write  for  the  operating  characteristic  of  the  response  pattern,  Py  (r)  , 

(3.56)  Py(r)  =  n  W  =  n  PU*)  =  M9)  ■ 

k3(V  k,iV 

As  before,  the  item  response  information  function,  7^(r)  ,  is  defined  by 
(3-57)  /{Jr)  =  ~  log  Pfc»  , 

and  for  the  item  information  function,  fj(r)  ,  and  the  test  information  function,  /‘(t)  ,  we  can  write 
from  (3.57),  (2.3)  and  (2.8) 


(3.58) 


and 


kg  kg 

=  E I  I'M")  |l2  IR,(*)|-  =  '.(*)  I  f/ 


(3.59)  = 

y=i 

respectively.  Let  Ty  be  any  estimator  of  r  ,  which  may  be  biased  or  unbiased.  In  general,  we  can 
write 


(3.60)  E(ry  |  t)  =  t  +  E(ry  -  r  ]  r)  , 

and,  differentiating  (3.60)  with  respect  to  6  ,  we  obtain 
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(3.61) 


I 

t 

I 


Since  from  (3.56)  we  can  also  write  for  E(ry  [  r) 

(3-62)  Eft  \r)  =  £  V  W  =  EV  iVW  , 

V'  u 

differentiating  (3.62)  with  respect  to  6  and  following  a  logic  similar  to  that  used  in  Section  3.1,  we 
obtain 

(3-63)  ~E(tI  |  r)  =  v  Pv(6)  =  Y,K  ~  EM  |  r)\  {^J\r(9)\ 

=  |r)l  [|-log  fy(0)\Pv(e)  . 

v  00 

By  the  Cramer-Rao  inequality,  we  can  write 

(3-64)  [~EM  I  r))2  <  Var.(v  I  r)  £[{ ~  log  Pv  (*)}2]  , 

and  from  this,  (2.7),  (2.8),  (3.10)  and  (3.61)  we  obtain 

(3  65)  Var.M  M  >  [^Eft  I  r)]3  I^)]"1 

Thus  the  Tightest  hand  side  of  (3.65)  provides  us  with  the  minimum  variance  bound  of  any  estimator  of 
r  .  When  Ty  is  an  unbiased  estimator  of  r  ,  the  second  term  of  the  first  factor  of  the  Tightest  hand 
side  of  (3.65)  equals  zero,  and  by  virtue  of  (3.59)  the  inequality  is  reduced  to 

(366)  Var.M\r)  >  [ [W‘  =  [/*(-)]~l  • 

For  the  mean  squared  error,  E[{ry  -  r)2  |  r]  ,  we  can  write 
(3  67)  E\(ry  -  r)2  |  r]  =  Var.M  I  r)  +  \E(ry  |  r)  -  r]2  , 


and  from  this  and  (3.65)  we  obtain 

(3.68)  E\{fy  —  r)2  |  r]  >  +  §-E(ry  -  r  |  r)]2  (/(tf))" 1  +  \E(ry  |  r)  -  r|2 


23 


[III. 7]  Modified  Test  Information  Functions  Based  upon  the  Transformed 

Latent  Variable 

The  maximum  likelihood  estimator,  fy  ,  of  r  ,  can  be  obtained  by  the  direct  transformation  of  the 
maximum  likelihood  estimate,  6y  ,  of  8  ,  i.e., 

(3.69)  ry  =  t(§v)  . 

Let  B*(fy  |  t)  be  the  MLE  bias  function  defined  for  the  transformed  latent  variable  r  ,  i.e., 

(3.70)  B'(tv  \t)  =  E(fv -t\t)  . 

FYom  this,  (3.65)  and  (3.68)  we  obtain 

(3-71)  Var.(fv  |  r)  >  (|l  +  ^B*(ry  \  r))2]/^)]-1 

and 

(3.72)  £[(fy  -  r)2  |  r]  >  [g  +  ^B*(ry  |  r))2^)]-1  +  [B‘(ry  |  r))2  . 

The  reciprocals  of  the  right  hand  sides  of  the  above  two  inequalities  provide  us  with  the  two  modified 
test  information  functions  for  the  transformed  latent  variable  r  ,  i.e., 

(3  73)  r'(T)  =  I(8)\^  +  ±B'(Tv  |r)j-2 

and 

(3-74)  5*(r)  =  1(8)  [{ye  +  ^B*( tv  |  r)}2  +  1(8)  {B* (fy  |  r)}2)->  . 

In  the  general  case  of  discrete  item  responses  we  can  write  for  the  MLE  bias  function  B*  (fy  |  t) 
and  its  derivative  with  respect  to  8 

(3-75)  B*(fy|r)  =  B(8V  \  9)&~' 1  -  (l/2)[/(<?)J- ; l[|V3  ~ 

OT  dT  OT~ 

=  B(8v\8)^-d+(\/2)\I(8))^d^  , 

and 

(3.76)  + 

+  (l/2)(/(6>)J-2[/(^)  -  I'(8)  , 

respectively  (cf.  Samejima,  1987).  Thus  we  can  use  (3.75)  and  (3.76)  in  evaluating  the  modified  test 
information  functions,  T* (r)  and  5*(r)  ,  which  are  given  by  (3.73)  and  (3.74). 
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[III. 8]  Discussion  and  Conclusions 

A  minimum  bound  of  any  estimator,  biased  or  unbiased,  has  been  considered,  and,  based  on  that, 
Modification  Formula  No.  1  has  been  proposed  for  the  maximum  likelihood  estimator,  in  place  of  the 
test  information  function.  A  minimum  bound  of  the  mean  squared  error  of  any  estimator  has  also  been 
considered,  and,  based  on  that,  Modification  Formula  No.  2  in  the  same  context  has  been  proposed. 
Examples  have  been  given.  These  topics  have  also  been  discussed  and  observed  for  the  monotonically 
transformed  latent  variable. 

It  is  expected  that  these  two  modification  formulae  of  the  test  information  function  can  effectively 
be  used  in  order  to  supplement  a  relative  weakness  of  the  test  information  function  in  certain  situations. 
Results  are  yet  to  come. 
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IV  Reliability  Coefficient  and  Standard  Error  of  Measure¬ 
ment  in  Classical  Mental  Test  Theory  Predicted  in  the 
Context  of  Latent  Trait  Models 

By  virtue  of  the  population-free  characteristic  of  the  test  information  function  7(0),  adding  further 
information  about  the  MLE  bias  function  of  the  test  and  the  ability  distribution  of  the  examinee  group, 
we  can  provide  the  tailored  reliability  coefficient  and  standard  error  of  measurement  in  the  sense  of 
classical  mental  test  theory  for  each  and  every  specified  group  of  examinees  who  have  taken  the  same 
test  (cf.  Samejima,  1977b,  1987)!  This  is  further  facilitated  by  the  proposal  of  the  modifications  of  the 
test  information  function,  which  use  the  MLE  bias  function  (cf.  Samejima,  1987,  1990),  and  have  been 
introduced  in  the  preceding  chapter. 

Thus  now  we  are  in  the  position  to  predict  the  so-called  reliability  coefficient  and  standard  error 
of  measurement  of  a  test  in  the  sense  of  classical  mental  test  theory,  taking  advantage  of  the  new 
developments  in  latent  trait  models,  which  are  tailored  for  a  specific  population  of  examinees.  It  will  be 
shown  in  this  chapter  how  we  can  do  that. 

[IV. l]  General  Case 

Let  9y  be  any  estimator  of  ability  6  .  We  can  write 

(4.1)  =  9  +  e  , 


where  e  denotes  the  error  variable.  In  the  test-retest  situation,  we  have 

—  9  +  El 


(4.2) 


'VI 


9 V2  —  9  +  E2  , 


where  the  subscripts,  1  and  2  ,  indicate  the  test  and  retest  situations,  respectively.  If  we  can  reasonably 
assume  that  in  the  test  and  retest  situations: 


(4.3)  Cov. (£1,^2)  —  0  , 

(4.4)  Var.fei)  =  V  ar.(e2) 
and 

(4.5)  Cov. (9, Ex)  =  Cov.[9,e2)  =  0  , 

then  we  will  have 

(4.6)  Corr.(9y  lt9y2)  -  [Kar.f^- , )  -  Var.fe,  J)- 1  . 

Note  that  if  we  replace  ability  9  by  the  true  test  score  T  ,  a  transformed  form  of  9  specific  to  a  given 
test,  and  use  the  observed  test  score  X  as  the  estimator  of  T  ,  and  E  as  its  error  of  estimation,  then 
(4.1)  can  be  rewritten  in  the  form 
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(4.7) 


X=T+E  , 


which  represents  the  fundamental  assumption  in  classical  mental  test  theory,  and  (4.6)  becomes  a 
familiar  formula  for  the  reliability  coefficient  rx,x,  ■ 


(4.8)  rXlX,  =  Var.[T)[Var.(X)\-1  . 

In  classical  mental  test  theory,  however,  researchers  seldom  check  if  these  assumptions  are  acceptable. 
In  fact,  in  many  cases  (4.5)  is  violated  if  we  replace  9  by  T  ,  and  ei  and  £2  by  E 1  and  E2  , 
respectively,  unless  the  test  has  been  constructed  in  such  a  way  that  most  individuals  from  the  target 
population  have  mediocre  true  scores. 

We  can  write  in  general 


(4.9)  Kar.(e)  =  E[e  -  E(e) j2 

+  2E[(e -£?(«  |  *))(£(«  |  *)-£(»))]. 


This  indicates  that,  if  the  error  variable  £  is  conditionally  unbiased  for  the  interval  of  9  of  interest, 
then  (4.9)  will  be  reduced  to  the  form 

(4.10)  Var.(e)  =  £|e2]  . 

[IV. 2]  Reliability  Coefficient  of  a  Test  in  the  Sense  of  Classical  Mental 
Test  Theory  When  the  Maximum  Likelihood  Estimator  of  6  Is 
Used 

Let  9v  or  9  denote  the  maximum  likelihood  estimator  of  6  based  upon  the  response  pattern 
V  .  If:  1)  9  i 8  conditionally  unbiased  for  the  interval  of  9  of  interest  and  2)  the  test  information 
function  /(0)  assumes  reasonably  high  values  for  that  interval,  then  we  will  be  able  to  approximate  the 
conditional  distribution  of  9  ,  given  9  ,  by  the  normal  distribution  N{9,  [/(<?)]'  *'2)  for  the  interval 
of  9  within  which  the  examinees’  ability  practically  distributes.  Thus  we  have  from  (4.10) 

(4.11)  Var.{e)  =  E{{I(9)}~1]  • 


When  this  is  the  case,  from  (4.6)  we  can  write 

(4.12)  Corr.{9u92)  -  [V^ar.^)  -  £[{/(0)}-,]|[Kar.(01)l-1  . 

Thus  the  reliability  coefficient  in  the  sense  of  classical  mental  test  theory  can  be  predicted  by  a  single 
administration  of  the  test,  given  the  test  information  function  [(9)  and  the  ability  distribution  of  the 
examinees. 
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The  appropriateness  of  the  above  normal  approximation  of  the  conditional  distribution  of  8  ,  given 
6  ,  can  be  examined  by  the  Monte  Carlo  method  (cf.  Samejima,  1977a).  We  also  notice  that  a  necessary 
condition  for  this  approximation  is  that  0  is  conditionally  unbiased  for  the  interval  of  8  of  interest. 
Thus  we  can  use  the  MLE  bias  function,  which  was  introduced  in  Section  2,  for  a  test  for  the  support 
of  the  approximation.  Note  that  the  MLE  bias  function  together  with  the  ability  distribution  of  the 
target  population  also  determines  whether  the  assumption  described  by  (4.5)  should  be  accepted. 

If  the  conditional  unbiasedness  is  not  supported,  i.e. ,  if  B(6y  |  8)  does  not  approximately  equal 
zero  for  all  values  of  8  in  the  interval  of  interest,  however,  then  we  shall  be  able  to  adopt  one  of  the 
modified  test  information  functions,  T(0)  or  E(8)  .  Thus  we  can  rewrite  (4.12)  into  the  forms 

(4.13)  Corr.{hM)  =  (Var.(l,)  -  £[{T(0)}- ^[Var.^,)]- 1 
and 

(4.14)  Corr.(8lJ2)  =  \Var.(§i)  -  £[{=(0)}~1]j[Vrar.(01)|_1  . 

We  can  decide  which  of  the  modified  formulae,  (4.13)  or  (4.14),  is  more  appropriate  to  use  in  a  specified 
situation. 

[IV. 3]  Standard  Error  of  Measurement  of  a  Test  in  the  Sense  of  Classical 

Mental  Test  Theory  When  the  Maximum  Likelihood  Estimator  of 
0  Is  Used 

In  classical  mental  test  theory,  the  standard  error  of  estimation  of  ability  is  represented  by  a  single 
number,  which  is  heavily  affected  by  the  degree  of  heterogeneity  of  the  group  of  examinees  tested, 
as  is  the  case  with  the  reliability  coefficient.  In  contrast,  in  latent  trait  models,  the  standard  error  of 
estimation  is  locally  defined,  i.e.,  as  a  function  of  ability.  It  is  usually  represented  by  the  reciprocal  of  the 
square  root  of  the  test  information  function.  Since  the  test  information  function  does  not  depend  upon 
any  specific  group  of  examinees,  but  is  a  sole  property  of  the  test  itself,  this  locally  defined  standard 
error  is  much  more  appropriate  than  the  standard  error  of  estimation  in  classical  mental  test  theory. 
Also  this  function  indicates  that  no  test  is  efficient  in  ability  measurement  for  the  entire  range  of  ability, 
and  each  test  provides  us  with  large  amounts  of  information  only  locally,  which  makes  a  perfect  sense 
to  our  knowledge. 

The  standard  error  of  measurement  of  a  test  tailored  for  a  specific  ability  distribution  is  given  by 

(4.15)  S.E.  =  £[{/(0)}"1/2l 


when  the  conditions  1)  and  2)  described  in  the  preceding  section  are  met,  and  by 

(4.16)  S.EA  =  £[(T(0)}~1/2] 
or 

(4.17)  S.E.  2  =  £({5((?)}-‘/2] 

otherwise. 
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FIGURE  4-1 


THETA 


Density  Functions  of  Six  Hypothetical  Ability  Distributions:  n(0.0,  1.0), 
n(-0.8,  1.0),  n(0.0,  0.5),  n(-0.8,  0.5),  n(-1.6,  0.5)  and  n(-2.4,  0.5). 


[IV. 4]  Examples 

For  the  purpose  of  illustration,  six  ability  distributions  are  hypothesized,  and  for  a  single  test 
predictions  are  made  for  their  tailored  reliability  coefficients  and  tailored  standard  errors  of  measurement 
in  the  sense  of  classical  mental  test  theory,  using  (4.12),  (4.13),  (4.14),  (4.15),  (4.16)  and  (4.17).  These  six 
hypothetical  ability  distributions  are  normal  distributions,  i.e.,  N( 0.0, 1.0),  JV(-0.8, 1.0)  ,  IV(0. 0,0.5)  , 
N(— 0.8, 0.5)  ,  Ar(— 1.6, 0.5)  and  A'" ( —  2.4, 0.5)  .  Figure  4-1  presents  the  density  functions  of  these  six 
distributions.  The  hypothetical  test  used  here  is  the  same  one  introduced  in  the  preceding  chapter, 
which  consists  of  thirty  equivalent  dichotomous  items  following  the  logistic  model  represented  by  (3.43) 
with  the  common  values  of  parameters,  a„  =  1.0  and  bg  =  0.0  ,  respectively,  and  with  the  scaling 
factor  D  set  equal  to  1.7  .  The  MLE  bias  function  and  the  square  roots  of  the  test  information 
function  /($)  and  of  its  two  modification  formulae  T(0)  and  —($)  of  this  test  are  shown  in  Figure 
3-5  of  the  preceding  chapter. 

Tables  4-1  and  4-2  present  the  resulting  predicted  reliability  coefficients  and  standard  errors  of 
measurement  for  the  six  different  ability  distributions,  respectively.  In  each  table,  the  mean  and  the 
variance  of  9  of  each  of  the  six  distributions  are  also  given.  We  can  see  that  these  variances  are  slightly 
different  from  the  squares  of  the  second  parameters  of  the  normal  distributions,  i.e.,  0.98322  vs. 
1.00000  for  the  populations  1  and  2,  and  0.25155  vs.  0.25000  for  the  populations  3,  4,  5  and  6, 
respectively,  whereas  all  of  the  means  are  the  same  as  the  first  parameters  of  the  normal  distributions. 
These  discrepancies  in  variance  come  from  the  fact  that  we  used  frequencies  for  the  equally  spaced 
points  of  9  with  the  step  width  0.05  ,  which  are  given  as  integers,  in  order  to  approximate  the  norm  .i 
distributions,  instead  of  using  the  density  functions  themselves. 

As  you  can  see  in  the  first  table,  the  predicted  reliability  coefficient  obtained  by  (4.12)  distributes 
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TABLE  4-1 


Three  Predicted  Reliability  Coefficients  Tailored  for  Each  of  the  Six  Hypothetical  Ability 
Distributions,  Using  the  Original  Test  Information  Function  and  Its  Two  Modification 
Formulae.  The  Indices,  1,  2  and  3,  Represent  the  Original  Test  Information  Function, 
Modification  Formula  No.  1  and  Modification  Formula  No.  2,  Respectively.  The 
Mean  and  the  Variance  of  6  for  Each  Population  Are  Also  Given. 


POPULATION 

RELIABILITY 

1 

RELIABILITY 

2 

RELIABILITY 

3 

MEAN  OF 
THETA 

VARIANCE 
OF  THETA 

1 

0.89641 

0.78053 

0.76629 

0.00000 

0.98322 

2 

0.82324 

0.26479 

0.25256 

-0.80000 

0.98322 

3 

0.81738 

0.80074 

0.79920 

0.00000 

0.25155 

4 

0.73250 

0.66611 

0.65589 

-0.80000 

0.25155 

5 

0.47715 

0.21681 

0.20093 

-1.60000 

0.25155 

6 

0.20049 

0.01182 

0.01109 

-2.40000 

0.25155 

TABLE  4-2 

Three  Predicted  Standard  Errors  of  Measurement  Tailored  for  Each  of  the  Six  Hypothetical 
Ability  Distributions,  Using  the  Original  Test  Information  Function  and  Its  Two 
Modification  Formulae.  The  Indices,  1,  2  and  3,  Represent  the  Original  Test 
Information  Function,  Modification  Formula  No.  1  and  Modification 
Formula  No.  2,  Respectively.  The  Mean  and  the  Variance  of  6  for 
Each  Population  Are  Also  Given. 


POPULATION 

STAND. ERROR 

1 

STAND. ERROR 

2 

STAND. ERROR 
3 

MEAN  OF 
THETA 

VARIANCE 
OF  THETA 

1 

0.30548 

0.37648 

0.38514 

0.00000 

0.98322 

2 

0.37887 

0.64293 

0.66397 

-0.80000 

0.98322 

3 

0.23521 

0.24717 

0.24811 

0.00000 

0.25155 

4 

0.29172 

0.32802 

0.33326 

-0.80000 

0.25155 

5 

0.48839 

0.73440 

0.76583 

-1.60000 

0.25155 

6 

0.91974 

2.76394 

2.88922 

-2.40000 

0.25155 
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TABLE  4-3 


Three  Theoretical  Variances  of  the  Maximum  Likelihood  Estimates  of  d  for  Each 
of  the  Six  Hypothetical  Ability  Distributions,  Using  the  Original  Test  Information 
Function  and  Its  Two  Modification  Formulae.  The  Indices,  1,  2  and  3,  Represent 
the  Original  Test  Information  Function,  Modification  Formula  No.  1  and 
Modification  Formula  No.  2,  Respectively.  The  Mean  and  the  Variance 
of  6  for  Each  Population  Are  Abo  Given. 


POPULATION 

VARIANCE 

OF  MLE  1 

VARIANCE 

OF  MLE  2 

VARIANCE 

OF  MLE  3 

MEAN  OF 
THETA 

VARIANCE 
OF  THETA 

1 

1.09684 

1.25968 

1.28308 

0.00000 

0.98322 

2 

1.19432 

3.71324 

3.89296 

-0.80000 

0.98322 

3 

0.30775 

0.31414 

0.31475 

0.00000 

0.25155 

4 

0.34341 

0.37763 

0.38352 

-0.80000 

0.25155 

5 

0.52718 

1.16023 

1.25189 

-1.60000 

0.25155 

6 

1.25469 

21.28788 

22.68190 

-2.40000 

0.25155 

TABLE  4-4 

Three  Theoretical  Error  Variances  for  Each  of  the  Six  Hypothetical  Ability  Distributions, 
Using  the  Original  Test  Information  Function  and  Its  Two  Modification  Formulae.  The 
Indices,  1,  2  and  3,  Represent  the  Original  Test  Information  Function,  Modification 
Formula  No.  I  and  Modification  Formula  No.  2,  Respectively.  The  Mean  and  the 
Variance  of  6  for  Each  Population  Are  Also  Given. 


POPULATION 

VARIANCE 

OF  ERROR  1 

VARIANCE 

OF  ERROR  2 

VARIANCE 

OF  ERROR  3 

MEAN  OF 
THETA 

VARIANCE 
OF  THETA 

1 

0.11363 

0.27646 

0.29987 

0.00000 

0.98322 

2 

0.21111 

2.73003 

2.90974 

-0.80000 

0.98322 

3 

0.05620 

0.06260 

0.06320 

0.00000 

0.2515? 

4 

0.09186 

0.12609 

0.13197 

-0.80000 

0.25155 

5 

0.27563 

0.90868 

1.00034 

-1.60000 

0.25155 

6 

1.00314 

21.03633 

22.43035 

-2.40000 

0.25155 
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TABLE  4-5 


Reliability  Coefficient  Computed  for  Each  of  the  Six  Hypothetical  Ability  Distributions  Based 
upon  the  Maximum  Likelihood  Estimates  of  the  Examinees  for  Test-Retest  Situations  Using 
a  Test  of  Thirty  Equivalent  Items  Following  the  Logistic  Model  with  D  =  1.7  ,  ag  —  1.0 
and  b9  =  0.0  .  The  Means  and  Variances  of  the  Two  Sessions  and  the  Covariances  Are 

Also  Presented. 


POPULATION 

RELIABILITY 

MEAN 

MEAN 

VARIANCE 

VARIANCE 

COVARIANCE 

1 

2 

1 

2 

1 

0.90788 

-0.00311 

0.00106 

1.19069 

1.16769 

1.07051 

2 

0.88812 

-0.81435 

-0.80971 

1.07982 

1.09703 

0.96663 

3 

0.80724 

0.00785 

-0.00754 

0.33578 

0.33443 

0.27051 

4 

0.72334 

-0.85777 

-0.84349 

0.40504 

0.39310 

0.28863 

5 

0.55304 

-1.68722 

-1.67511 

0.42299 

0.40820 

0.22980 

6 

0.32187 

-2.28115 

-2.25897 

0.21639 

0.23189 

0.07210 

widely,  i.e.,  it  varies  from  0.200  to  0.896  !  The  coefficient  reduces  as  the  main  part  of  the  distribution 
shifts  from  a  range  of  6  where  the  amount  of  test  information  is  greater  to  another  range  where  it  is 
lesser.  The  reduction  is  more  conspicuous  when  the  standard  deviation  of  the  normal  distribution  is 
smaller.  The  predicted  reliability  coefficient  obtained  by  (4.13)  using  T(0)  instead  of  1(9)  indicates 
a  substantial  reduction  from  the  one  obtained  by  (4.12)  for  each  of  the  six  ability  distributions.  The 
reduction  is  especially  conspicuous  for  the  populations  2,5,  and  6  ,  whose  ability  distributes  on  lower 
levels  of  6  where  the  discrepancies  between  1(6)  and  T(0)  are  large.  Among  the  six  populations 
the  predicted  reliability  coefficient  obtained  by  means  of  (4.13)  varies  from  0.012  to  0.781  ,  showing 
and  even  larger  range  than  that  obtained  by  (4,12).  Similar  results  were  obtained  for  the  predicted 
reliability  coefficient  given  by  (4.14),  using  5(8)  instead  of  1(6)  .  The  reliability  coefficient  varies  from 
0.011  to  0.786  ,  and  within  each  population  the  reduction  in  the  value  of  the  reliability  coefficient  from 
the  one  obtained  by  (4.13)  is  relatively  small,  as  is  expected  from  the  second  graph  of  Figure  3-5. 

As  for  the  standard  error  of  measurement,  we  can  see  in  Table  4-2  that  similar  results  were  obtained, 
only  in  reversed  order,  of  course.  In  classical  mental  test  theory,  the  standard  error  of  measurement 
a e  is  given  by 


(4.18) 


<T£  =  (Vor.Wj^fl  -  rXl 


where,  as  before,  rXlX,  indicates  the  reliability  coefficient.  Comparison  of  Table  4-1  and  Table  4-2 
reveals  that  there  are  substantial  discrepancies  between  the  values  of  as  obtained  by  formula  (4.18) 
using  the  tailored  reliability  coefficients  in  Table  4-1,  which  are  based  upon  the  maximum  likelihood 
estimate  6  ,  in  place  of  rXlX,  in  (4.18)  and  the  corresponding  standard  errors  of  measurement,  which 
were  obtained  by  formulae  (4.15)  through  (4.17)  and  presented  in  Table  4-2.  To  give  some  examples, 
for  Population  No.  1  the  results  of  (4.18)  are:  0.319  ,  0.465  and  0.479  ,  respectively;  for  Population 
No.  3  they  are:  0.214  ,  0.224  and  0.225  ;  and  for  Population  No.  6  they  are:  0.448  ,  0.499  and 
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0.499  .  These  results  are  understandable,  for  the  degree  of  violation  from  the  assumptions  behind  the 
classical  mental  test  theory  is  different  for  the  separate  ability  distributions. 

The  three  theoretical  variances  of  the  maximum  likelihood  estimate  of  9  and  the  three  theoretical 
error  variances  are  presented  in  Tables  4-3  and  4-4,  respectively,  for  each  of  the  six  hypothetical  pop¬ 
ulations.  The  latter  were  obtained  by  (4.11)  and  by  replacing  7(5)  in  (4.11)  by  T(5)  and  5(5)  , 
respectively,  and  the  former  are  the  sum  of  these  separate  error  variances  and  the  variance  of  8  . 

In  order  to  satisfy  our  curiosity,  a  simulation  study  has  been  made  in  such  a  way  that,  following 
each  of  the  six  ability  distributions,  a  group  of  examinees  is  hypothesized,  and,  using  the  Monte  Carlo 
method,  a  response  pattern  of  each  hypothetical  subject  is  produced  for  each  of  the  test  and  retest 
situations.  Since  our  test  consists  of  thirty  equivalent  dichotomous  test  items,  the  simple  test  score  is  a 
sufficient  statistic  for  the  response  pattern,  and  the  maximum  likelihood  estimate  of  9  can  be  obtained 
upon  this  sufficient  statistic.  The  numbers  of  hypothetical  subjects  are  1,998  for  Populations  No.  1 
and  No.  2,  and  2,004  for  Populations  No.  3,  No.  4,  No.  5  and  No.  6.  The  correlation  coefficient 
between  the  two  sets  of  8  ’s  was  computed,  and  the  results  are  presented  in  Table  4-5.  Comparison  of 
each  of  these  results  with  the  corresponding  three  tailored,  reliability  coefficients  in  Table  4-1  gives  the 
impression  that,  overall,  these  correlation  coefficients  are  higher  than  the  predicted  tailored  reliability 
coefficients.  This  enhancement  comes  from  the  fact  that  in  each  distribution  there  are  a  certain  number 
of  subjects  who  obtained  negative  or  positive  infinity  as  9  ,  and  we  have  replaced  these  negative  and 
positive  infinities  by  more  or  less  arbitrary  values,  —2.65  and  2.65  ,  respectively,  in  computing  the 
correlation  coefficients.  Since  in  Population  No.  3  none  of  the  2,004  hypothetical  subjects  got  negative 
or  positive  infinity  for  their  maximum  likelihood  estimates  of  6  in  the  first  session,  and  only  three  got 
negative  infinity  and  none  got  positive  infinity  in  the  second  session,  this  result,  0.807  ,  will  be  the 
most  trustworthy  value.  We  can  see  that  this  value,  0.807  ,  is  less  than  0.817  obtained  by  using  the 
original  test  information  function  1(8)  ,  and  a  little  greater  than  0.801  obtained  upon  the  Modification 
Formula  No.  1,  T(5)  .  The  next  most  trustworthy  value  may  be  0.723  of  Population  No.  4,  for  which 
none  of  the  2,004  subjects  obtained  positive  infinity  as  their  8  ’s  in  each  of  the  two  sessions,  and  56 
and  45  got  negative  infinity  in  the  first  and  second  sessions,  respectively.  This  value  of  the  correlation 
coefficient,  0.723  ,  is  a  little  less  than  the  predicted  reliability  coefficient  0.733  obtained  upon  7(5)  , 
but  somewhat  greater  than  0.666  ,  which  is  based  upon  T(5)  ,  the  Modification  Formula  No.  1 — the 
artificial  enhancement  is  already  visible.  The  numbers  of  subjects  who  obtained  negative  and  positive 
infinities  in  the  first  session  and  in  the  second  session  are:  56  ,  47  ,  43  and  49  for  Population  No. 
1;  197  ,  4  ,  195  and  6  for  Population  No.  2;  437  ,  0  ,  399  and  0  for  Population  No.  5;  and 
1,  143  ,0,1,  118  and  0  for  Population  No.  6.  We  must  say  that,  for  these  four  distributions,  the 
values  of  the  correlation  coefficients  in  Table  4-5  should  not  be  taken  too  seriously,  for  these  values  are 
enhanced  because  of  the  involvement  of  too  many  substitute  values  for  negative  and  positive  infinities. 

[IV. 5]  Discussion  and  Conclusions 

Test  information  function  1(9)  and  its  two  modification  formulae,  T(5)  and  5(5)  ,  have  been 
used  to  predict  the  reliability  coefficient  and  the  standard  error  of  measurement  which  are  tailored  for 
each  specific  ability  distribution.  Examples  of  the  prediction  have  been  given  and  a  simulation  study 
has  been  conducted  and  shown  for  comparison.  These  examples  using  equivalent  test  items  have  been 
rather  intentionally  chosen  to  make  the  differences  among  the  separate  ability  distributions,  and  those 
among  the  three  predicted  indices  for  each  ability  distribution,  clearly  visible. 

Since  we  have  more  useful  and  informative  measures  like  the  test  information  function  and  its  two 
modified  formulae,  the  reliability  coefficient  of  a  test  is  no  longer  necessary  in  modern  mental  test 
theory.  And  yet  it  is  interesting  to  know  how  to  predict  the  coefficient  using  these  functions,  which  are 
tailored  for  each  separate  population  of  examinees.  In  this  process,  it  will  become  more  obvious  that 
the  traditional  concept  of  test  reliability  is  misleading,  for  without  changing  the  test  the  coefficient  can 
be  drastically  different  if  we  change  the  population  of  examinees. 
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V  Validity  Measures  in  the  Context  of  Latent  Trait  Models 

FVom  the  scientific  point  of  view,  we  need  to  confirm  if  a  given  test  indeed  measures  what  it  is 
supposed  to  measure,  even  if  we  have  chosen  our  items  carefully  enough  in  regard  to  their  contents,  and 
even  if  we  are  equipped  with  highly  sophisticated  mathematics. 

By  virtue  of  the  population-free  nature  of  latent  trait  theory,  we  should  be  able  to  find  some  indices 
of  item  validity,  and  of  test  validity,  which  are  not  affected  by  the  group  of  examinees.  The  resulting 
indices  should  not  be  incidental  as  those  in  classical  mental  test  theory  are,  but  truly  be  attributes  of 
the  item  and  the  test  themselves.  Thus  an  attempt  has  been  made  in  the  present  research  to  obtain 
such  population-free  measures  of  item  validity  and  of  test  validity,  which  are  basically  locally  defined. 

[V.l]  Performance  Function:  Regression  of  the  External  Criterion  Vari¬ 
able  on  the  Latent  Variable 

It  is  assumed  that  there  exists  an  external  criterion  variable,  which  can  be  measured  directly  or 
indirectly.  This  is  the  situation  which  is  also  assumed  when  we  deal  with  criterion-oriented  validity  or 
predictive  validity  in  classical  mental  test  theory. 

Let  7  denote  the  criterion  variable,  representing  the  performance  in  a  specific  job,  etc.  We  shall 
consider  the  conditional  density  of  the  criterion  performance,  given  ability,  and  denote  it  by  £(7  |  0)  . 
The  performance  function,  g(9)  ,  can  be  defined  as  the  regression  of  7  on  6  ,  or  by  taking,  say,  the 
75,  90  or  95  percentile  point  of  each  conditional  distribution  of  7  ,  given  9  .  Let  p(,  denote  the 
probability  which  is  large  enough  to  satisfy  us  as  a  confidence  level.  Thus  we  can  write 

(5.1)  Pa  =  [  £(7  |  0)  <*7  , 

•'«(») 

where  7  denotes  the  least  upper  bound  of  the  criterion  variable  7  . 

Figure  5-1  illustrates  the  relationships  among  9  ,  7  ,  p„  ,  £(7  |  8)  and  g(8)  .  It  may  be  reasonable 
to  assume  that  the  functional  relationship  between  8  and  is  relatively  simple,  not  as  is  illustrated 

by  the  solid  line  in  Figure  5-2,  i.e.,  we  do  not  expect  to  go  up  and  down  frequently  within  a 

relatively  short  range  of  9  .  We  shall  assume  that  g(9)  is  twice  differentiable  with  respect  to  8  . 
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In  dealing  with  an  additional  dimension  or  dimensions  in  latent  space,  i.e.,  the  criterion  variable  or 
variables,  one  of  the  most  difficult  issues  is  to  keep  the  population-free  nature,  which  is  characteristic  of 
the  latent  trait  models,  the  main  feature  that  distinguishes  the  theory  from  classical  mental  test  theory, 
among  others.  If  we  consider  the  projection  of  the  operating  characteristic  of  a  discrete  item  response 
on  the  criterion  dimension,  for  example,  then  the  resulting  operating  characteristic  as  a  function  of  7 
has  to  be  incidental,  for  it  has  to  be  affected  by  the  population  distribution  of  8  . 

We  need  to  start  from  the  conditional  distribution  of  7  ,  given  8  ,  therefore,  which  can  be  conceived 
of  as  being  intrinsic  in  the  relationship  between  the  two  variables,  and  independent  of  the  population 
distribution  of  8  .  We  assume  that  takes  on  the  same  value  only  at  a  finite  or  an  enumerable 

number  of  points  of  8  .  Let  (f)  be  the  conditional  probability  assigned  to  the  discrete  response 
kg  ,  given  g  .  We  can  write 

(5.2)  J2  P*A9)  ■ 

[V.2]  When  $(9)  Is  Strictly  Increasing  in  6  :  Simplest  Case 

The  simplest  case  is  that  g(8)  is  strictly  increasing  in  8  .  In  this  case,  g(8)  has  a  one-to-one 
correspondence  with  8  ,  and  (5.2)  becomes  simplified  into  the  form 

(5-3)  Pk,U)  =  PkM»)\  =  Pk'{0)  ■ 

If,  in  addition,  {dd/dg}  is  finite  throughout  the  entire  range  of  8  ,  then  we  obtain 

(M  £«,<«>  = 

Let  /fca(?)  be  the  item  response  information  function  defined  as  a  function  of  $  .  We  can  write 

(»-»)  «.<*>  =  = 

-  '*,(»)  -  (*)]“  0  • 

Let  /*(?)  and  /*(f)  be  the  amounts  of  information  given  by  a  single  item  g  and  by  the  total 
test,  respectively,  for  a  fixed  value  of  f  .  Then  we  have  from  (2.3),  (2.6)  and  (5.5) 

(5.6)  w  =  E\rk,u)  1  s\  =  E  4*0?)  />;,(?)  =  m  i~)2 

kg  ? 

and 

(5.7)  /*(f)  =  E  /;w  =  m  (§?)2  • 

!>=1  ' 

If  we  take  the  square  roots  of  these  two  information  functions  defined  for  <  ,  then  we  obtain 

(5.8)  (/;(f)i1/2  =  [/,(*)  i,/2^ 
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and 


(5.9)  in?))1/a  ==  mi1/a  ^  • 

Since  a  certain  constant  nature  exists  for  the  square  root  of  the  item  information  function  while  the 
same  is  not  true  with  the  original  item  information  function  (cf.  Samejima,  1979,  1982),  [fj(^)]1^2 
given  by  (5.8)  instead  of  the  original  function  given  by  (5.6)  may  be  more  useful  in  some  occasions. 
This  will  be  discussed  later  in  this  section,  when  the  validity  in  selection  plus  classification  is  discussed. 

Suppose  that  we  have  a  critical  value,  70  1  of  the  criterion  variable,  which  is  needed  for  succeeding 
in  a  specified  job,  and  that  we  try  to  accept  applicants  whose  values  of  the  criterion  variable  are  70 
or  greater.  If  our  primary  purpose  of  testing  is  to  make  an  accurate  selection  of  applicants,  then 

(5.8)  and  (5.9)  for  g  =  70  ,  or  their  squared  values  shown  by  (5.6)  and  (5.7),  indicate  item  and  test 
validities,  respectively.  If  for  some  item  formula  (5.8)  or  (5.6)  assumes  a  high  value  at  f  =  70  ,  then 
the  standard  error  of  estimation  of  g  around  f  =  70  becomes  small  and  chances  are  slim  that  we 
make  misclassifications  of  the  applicants  by  accepting  unqualified  persons  and  rejecting  qualified  ones, 
and  the  reversed  relationship  holds  when  (5.8)  or  (5.6)  assumes  a  low  value  at  f  =  70  •  The  same  logic 
applies  to  the  total  test  by  using  formula  (5.9)  or  (5.7)  instead  of  (5.8)  or  (5.6). 

It  should  be  noted  in  (5.8)  or  in  (5.9),  that  [/J (70) J1^2  or  [/* (70)]1/2  consists  of  two  factors, 
i.e.,  l)  the  square  root  of  the  item  information  function  Ig(6)  or  that  of  the  test  information  function 
1(6)  and  2)  the  partial  derivative  of  ability  6  with  respect  to  g  at  g  =  70  .  These  two  factors  in 
each  formula  are  independent  of  each  other,  i.e.,  one  belongs  to  the  item  or  to  the  test  and  the  other 
to  the  statistical  relationship  between  6  and  7  .  We  also  notice  that  these  two  factors  are  in  a 
supplementary  relationship.  Thus  while  it  is  important  to  have  a  large  amount  of  item  information,  or 
of  test  information,  it  is  even  more  so  to  have  large  values  of  the  derivative,  {'f/dg}  ,  in  the  vicinity 

<1  =  7o  ,  for  this  will  increase  the  amount  of  item  information  defined  with  respect  to  g  uniformly 
in  that  vicinity,  and  also  that  of  test  information,  as  is  obvious  from  the  right  hand  sides  of  (5.8)  and 

(5.9) .  In  other  words,  it  is  desirable  for  the  purpose  of  selection  /or  g  to  increase  slowly  in  $  in  the 
vicinity  of  g  =  70  . 

Since,  in  general,  the  same  ability  6  has  predictabilities  for  more  than  one  kind  of  job  performance, 
or  of  potential  of  achievement,  the  performance  function  varies  for  different  criterion  variables.  Note 
that  neither  [/^(fl)]1'2  nor  [/(0)]1/2  is  changed  even  when  the  criterion  variable  is  switched.  Thus, 
for  a  fixed  item  or  test  whose  amount  of  information  is  reasonably  large  around  f  =  70  ,  the  derivative 
{86 /dg }  in  the  vicinity  of  ?  =  70  determines  the  appropriateness  of  the  use  of  the  item  or  of  the  test  for 
the  purpose  of  selection  with  respect  to  a  specific  job,  etc.  If  this  derivative  assumes  a  high  value,  then 
an  item  or  a  test  which  provides  us  with  a  medium  amount  of  information  may  be  acceptable  for  our 
purpose  of  selection,  while  we  will  need  an  item  or  a  test  whose  amount  of  information  is  substantially 
larger  if  the  derivative  is  low.  Also  for  the  same  criterion  variable  7  the  derivative  {86 /dg)  varies  for 
different  values  of  70  ,  so  the  appropriateness  of  an  item  or  of  a  test  depends  upon  our  choice  of  70  , 
too.  The  above  logic  also  applies  for  the  formulae  (5.6)  and  (5.7),  i.e.,  for  the  case  in  which  we  choose 
the  information  functions,  instead  of  their  square  roots,  changing  {86 /dg)  to  its  squared  value. 

It  is  obvious  from  (5.6)  and  (5.8)  that  we  can  choose  either  /„(#( 70))  or  !4(0('7o))]1/2  for  use  in 
item  selection,  for  their  rank  orders  across  different  items  are  identical,  and  they  equal  the  rank  orders 
of  /*( 70)  as  well  as  those  of  j  J*  (70)]  ^2  . 

If  we  take  another  standpoint  that  our  purpose  of  testing  is  not  only  to  make  a  right  selection  of 
applicants  but  also  to  predict  the  degree  of  success  in  the  fob  for  each  selected  individual ,  then  we  will 
need  to  integrate  ] / J ( ^ ) ) 1  ^ 2  and  l/*^)]1/2  ,  respectively,  since  we  must  estimate  g  accurately  not 
only  around  g  =  70  but  also  for  g  >  70  .  If  we  choose  (/J(c)]1^2  and  [/* (<r) | in  preference  to 
their  squared  values,  we  will  obtain  from  (5.8)  and  (5.9) 
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Some  Examples  of  the  Relationship  between  70  and  the  Item  Validity  Measure 

Given  by  (5.10). 


(5.10) 

/  (OT)1/2df  = 
Jn< 

f  1 

'n. 

and 

(5.11) 

[  [/*(f)]1/2df  = 
Jus 

f  \I(9)?l*d9 

where  flf  and  fi#  indicate  the  domains  of  f  and  9  for  which  $(9)  >  70  ,  respectively.  In  this 
situation  we  need  to  select  items  which  assume  high  values  of  (5.10)  instead  of  (5.8),  or  a  test  which 
provides  us  with  a  high  value  of  (5.11)  in  place  of  (5.9).  Note  that  formulae  (5.10)  and  (5.11)  imply  that 
we  can  obtain  these  two  validity  measures  directly  from  the  original  item  and  test  information  functions, 
respectively,  i.e.,  without  actually  transforming  9  to  $  ,  as  long  as  we  can  identify  the  domain  . 
This  is  true  for  any  criterion  variable  7  . 

Some  examples  illustrating  the  values  of  (5.10)  are  given  in  Figure  5-3  for  hypothetical  items.  In  the 
simplest  case  observed  in  this  section  and  illustrated  in  Figures  5-1  and  5-3,  these  two  domains,  fl# 
and  flf  ,  are  provided  by  the  two  intervals,  (0O  ,  00  )  and  (70  ,  7)  ,  where 

(5.12)  90  =  0(7o) 


and  7  denotes  the  least  upper  bound  of  7  . 

It  should  be  noted  that  the  above  pair  of  validity  measures  depends  upon  our  choice  of  the  critical 
value  70  .  If  this  value  is  low,  i.e.,  a  specified  job  does  not  require  high  levels  of  competence  with 
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Relationship  between  -y0  and  Item  Validity  Indicated  by  (5.10)  for  Three  Hypothetical 
Dichotomous  Items  Whose  Operating  Characteristics  for  the  Correct  Answer  Are 
Strictly  Increasing  with  Zero  and  Unity  as  Their  Asymptotes. 


respect  to  the  criterion  variable  7  ,  then  these  validity  indices  assume  high  values,  and  vice  versa. 
It  has  been  pointed  out  (Samejima,  1979,  1982)  that  there  is  a  certain  constancy  in  the  amount  of 
information  provided  by  a  single  test  item.  To  give  an  example,  if  an  item  is  dichotomously  scored  and 
has  a  strictly  increasing  operating  characteristic  for  success  with  and  unity  as  its  two  asymptotes, 
then  the  area  under  the  curve  for  (ftf(^)]1^2  equals  it  ,  regardless  of  the  mathematical  form  of  the 
operating  characteristic  and  its  parameter  values.  We  can  see,  therefore,  that  if  our  items  belong  to  this 
type  then  the  functional  relationship  between  70  and  the  item  validity  measure  given  by  (5.10)  will 
be  monotone  decreasing,  with  it  and  zero  as  its  two  asymptotes,  for  each  and  every  item.  Figure  5-4 
illustrates  this  relationship  for  three  hypothetical  items  of  this  type.  As  we  can  see  in  this  figure,  the 
appropriateness  of  the  items  changes  with  70  in  an  absolute  sense,  and  also  relatively  to  other  items 
with  7q  1  and  the  rank  orders  of  desirability  among  the  items  depend  upon  our  choice  of  70  • 

We  can  see  from  (5.10)  that  this  validity  measure  necessarily  assumes  a  high  value  if  an  item  is 
difficult,  and  the  same  applies  to  (5.11)  for  the  total  test.  This  implies  that  these  validity  measures 
alone  cannot  indicate  the  desirability  of  an  item  and  of  a  test  precisely  for  a  specific  population  of 
examinees.  In  selecting  items  or  a  test,  therefore,  it  is  desirable  to  take  the  ability  distribution  of  the 
examinees  into  account,  if  the  information  concerning  the  ability  distribution  of  a  target  population  is 
more  or  less  available.  In  so  doing  we  shall  be  able  to  avoid  choosing  items  which  are  too  difficult  for 
the  target  population  of  examinees.  Let  f{8)  denote  the  density  function  of  the  ability  distribution 
for  a  specific  population  of  examinees,  and  /*(?)  be  that  of  g  for  the  same  population.  Then  we  can 
write 

aa 

(5-13)  /*(*)  =  /(*)  ^  • 

Adopting  this  as  the  weight  function,  from  (5.8)  and  (5.9)  we  obtain  as  the  validity  indices  tatlored  for 
a  specific  population  of  examinees 
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(5.14) 


/  [/;(?)] i/2ru)d(=[  iia(<>)}1/2  m  %  *> 

Jo ,  J n«  of 


and 

(5.15)  /  [/*(?)]i/2  /*(f)  d<=  [  m\^  m  de . 

Jo,  J  n,  of 

Thus  by  using  (5.14)  and  (5.15)  instead  of  (5.10)  and  (5.11)  we  shall  be  able  to  make  appropriate  item 
selection  and  test  selection  for  a  target  population  or  sample,  provided  that  the  information  concerning 
its  ability  distribution  is  more  or  less  available.  Note  that,  unlike  (5.10)  and  (5.11),  formulae  (5.14) 
and  (5.15)  imply  that  these  validity  measures  are  also  heavily  dependent  upon  the  functional  formula 
of  g(0)  . 

If  we  choose  to  use  the  area  under  the  curve  of  the  information  function  instead  of  that  of  its  square 
root,  we  obtain  from  (5.6)  and  (5.7) 

(5.16)  /  4*(f)  df  =  /  /,(*)  j-  dd 

Jo,  Jo » 

and 

(5.17)  f  r(s)d{=  [  I{8)  ~  dd  , 

Jo,  J  r>„  dg 

respectively.  We  notice  that  in  this  case,  unlike  those  of  (5.10)  and  (5.11),  the  integrands  of  the  right 
hand  sides  of  (5.16)  and  (5.17)  are  no  longer  independent  of  the  functional  formula  of  f  (0)  .  Abo  when 
information  about  the  ability  distribution  of  a  target  population  of  examinees  is  more  or  less  available, 
the  tailored  item  and  test  validity  indices  become 

(5.18)  [  4*(f)/*(f)df=  /  m  f{e)  d6 
and 

(5.19)  f  /*(f)/*(f)d?=  [  J(9)  m  A2  de  , 

Jo,  Jo ,  dg 

respectively,  if  we  choose  to  use  the  information  functions  instead  of  their  square  roots. 

Note  that,  unlike  the  validity  measures  for  selection  purposes,  in  the  present  situation  the  rank 
orders  of  validity  across  different  items,  or  different  tests,  depend  upon  the  choice  of  the  validity  index. 
Thus  a  question  is:  which  of  the  formulae,  (5.10)  or  (5.16),  and  (5.11)  or  (5.17),  are  better  as  the 
item  and  the  test  validity  indices  for  selection  plus  classification  purposes?  A  similar  question  is  also 
addressed  with  respect  to  (5.14)  and  (5.18),  and  to  (5.15)  and  (5.19).  These  are  tough  questions  to 
answer.  While  the  choice  of  the  square  root  of  the  item  information  function  has  an  advantage  of  a 
certain  constancy  which  has  been  observed  earlier  in  this  subsection,  the  use  of  the  item  information 
has  a  benefit  of  additivity,  i.e.,  by  virtue  of  (2.8)  the  sum  total  of  (5.16)  over  all  the  item  g  's  equals 
(5.17),  and  the  same  relationship  holds  between  (5.18)  and  (5.19).  The  answers  to  these  questions  are 
yet  to  be  searched. 

When  our  purpose  of  testing  is  strictly  the  classification  of  individuals,  as  in  assigning  those  people 
to  different  training  programs,  in  guidance,  etc.,  (5.10)  and  (5.11),  or  (5.16)  and  (5.17),  also  serve  as  the 


40 


THETA 

FIGURE  5-5 

Example  of  the  Performance  Function  f(0)  Which  Is  Piecewise  Monotone  in  8  . 


validity  measures  of  an  item  and  of  a  test,  respectively.  In  this  case,  we  must  set  -70  =  7  in  defining  the 
domains,  fif  and  ,  where  7  is  the  greatest  lower  bound  of  7  .  Thus  the  two  domains,  Qf  and  ft*  , 
in  these  formulae  become  those  of  $  and  8  for  which  7  <  g  (0)  <  7  .  It  is  obvious  that  these  formulae 
provide  us  with  the  item  and  the  test  validity  measures,  respectively,  for  the  same  reason  explained 
earlier.  The  same  logic  applies  for  the  tailored  validity  measures  provided  by  (5.14)  and  (5.15),  and 
by  (5.18)  and  (5.19),  when  the  information  concerning  the  ability  distribution  of  a  target  population  is 
more  or  less  available. 

[V.3]  Test  Validity  Measures  Obtained  from  More  Accurate  Minimum 
Variance  Bounds 

When  {dg/d8}  =  0  at  some  value  of  8  ,  as  is  illustrated  by  a  dashed  line  in  Figure  5-2,  {d8  jdg) 
becomes  positive  infinity,  and  so  does  the  item  validity  measure  given  by  (5.8).  This  fact  provides  us 
with  some  doubt,  for,  while  we  can  see  that  at  such  a  point  of  g  item  validity  is  high,  we  must  wonder 
if  positive  infinity  is  an  adequate  measure.  It  is  also  obvious  from  (2.8)  that  the  same  will  happen  to  the 
total  test  if  it  includes  at  least  one  Buch  item.  Our  question  is:  should  we  search  for  more  meaningful 
functions  than  the  item  and  test  information  functions ?  This  topic  will  be  discussed  in  this  section. 

Necessity  of  the  search  for  a  more  accurate  measure  than  the  test  information  function  becomes 
more  urgent  when  the  performance  function,  f(0)  ,  is  not  strictly  increasing  in  8  ,  but  is,  say,  only 
piecewise  monotone  in  8  with  finite  {d8  /dg)  and  differentiable  with  respect  to  8  ,  as  is  illustrated 
in  Figure  5-5.  The  illustrated  performance  function  is  still  simple  enough,  but  indicates  the  trend  that 
after  a  certain  point  of  ability  the  performance  level  in  a  specified  job  decreases.  This  can  happen  when 
the  job  does  not  provide  enough  challenge  for  persons  of  very  high  ability  levels. 
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Since  /*  (f )  serves  as  the  reciprocal  of  the  conditional  variance  of  the  maximum  likelihood  estimate  of 
f  only  asymptotically  and  there  exist  more  accurate  minimum  variance  bounds  for  any  (asymptotically) 
unbiased  estimator  (cf.  Kendall  and  Stuart,  1961),  we  can  search  for  more  accurate  test  validity 
measures  than  the  one  given  by  (5.9)  by  using  the  reciprocal  of  the  square  roots  of  such  minimum 
variance  bounds. 


Let  Jr»{8)  he  defined  as 


(5.20) 


Jr.{&)  =  E{ 


Lv(6) 


Lv{6 ) 


I*] 


r,s  =  1,  2,  ...,k 


where 


(5.21) 
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Let  J(0)  denote  the  (k  x  k)  matrix  of  the  element  Jr,[9)  ,  and  Jr,i(8)  be  the  corresponding  element 
of  its  inverse  matrix,  J~1(8)  .  Note  that  when  k  =  1  we  can  rewrite  (5.20)  into  the  form 


(5.22)  Jkk(S)  =  Jn(9)  =  log  Lv(9)}2  I*] 

=  log /V  WM, 

and  from  this,  (2.7)  and  (2.8)  we  can  see  that  J(8)  is  a  (I  x  1)  matrix  whose  element  is  the  test 
information  function,  /(*)  ,  itself.  A  set  of  improved  minimum  variance  bounds  is  given  by 

k  k 

(5.23)  ££  ?<*>(*)  JrV(*)  *,r,(«) 

r=  X  *=  1 


(cf.  Kendall  and  Stuart,  1961),  where  f *'*(*)  denotes  the  s-th  partial  derivative  of  f(0)  with  respect 
to  8  .  We  obtain,  therefore,  for  a  set  of  new  test  validity  measures 

(5.24)  E£li,U-.,(»bo)Hi”r,/S  . 

r= 1 $= 1 


where  indicates  the  s-th  partial  derivative  of  f  with  respect  to  8  at  f  =  70  . 

The  use  of  this  new  test  validity  measure  will  ameliorate  the  problems  caused  by  {3^/38}  =  0  ,  if 
we  choose  an  appropriate  k  .  The  resulting  algorithm  will  become  much  more  complicated,  however, 
and  we  must  expect  a  substantially  larger  amount  of  CPU  time  for  computing  these  measures  when  k 
is  greater  than  unity.  Note  that  (5.24)  equals  (5.9)  when  k  =  1  . 


[V.4]  Multidimensional  Latent  Space 

When  our  latent  space  is  multidimensional,  a  generalization  of  the  idea  given  in  Section  5.3  for  the 
unidimensional  latent  space  can  be  made  straightforwardly.  We  can  write 

(5.25)  8=  {  9U  }'  u  =  1, 2, ...,  r)  , 
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FIGURE  5-6 

Area  fl#  for  Different  70  's  in  Two-Dimensional  Latent  Space 
for  a  Hypothesized  Test. 


and  the  performance  function  f(0)  becomes  a  function  of  rj  independent  variables.  A  minimum 
variance  bound  is  given  by 


(5.26) 


EE 


am  am 

deu  dev 


C'(«! 


where  Iu„(6)  is  the  (u,  o)-th  element  of  the  inverse  matrix  of  the  (7  x  r\)  symmetric  matrix,  whose 
element  is  given  by 

(s.27) 

with  L  abbreviating  Lv(9)  ,  or  Pv(8)  •  The  reciprocal  of  the  square  root  of  (5.27)  will  provide  us 
with  the  counterpart  of  (5.9)  for  the  multidimensional  latent  space.  For  rj  =  2  ,  the  area  fig  may  look 
like  one  of  the  contours  illustrated  in  Figure  5-6,  depending  upon  our  choice  of  70  .  taking  the  axis  for 
7  vertical  to  the  plane  defined  by  6\  and  62  ■ 


In  a  more  complex  situation  where  both  ability  and  the  criterion  variables  are  multidimensional,  we 
must  consider  the  projection  of  the  item  information  function  on  the  criterion  subspace  from  the  ability 
subspace,  in  order  to  have  the  item  validity  function  for  each  item,  and  then  the  test  validity  function. 
It  is  anticipated  that  we  must  deal  with  a  higher  mathematical  complexity  in  such  a  case.  The  situation 
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will  substantially  be  simplified,  however,  if  the  total  set  of  items  consists  of  several  subsets  of  items, 
each  of  which  measures,  exclusively,  a  single  ability  dimension  and  a  single  criterion  dimension. 


[V.5]  Discussion  and  Conclusions 

Some  considerations  have  been  made  concerning  the  validity  of  a  test  and  that  of  a  single  item. 
Effort  has  been  focused  upon  searching  for  measures  which  are  population-free,  and  which  will  provide 
us  with  local  and  abundant  information  just  as  the  information  functions  do  in  comparison  with  the  test 
reliability  coefficient  in  classical  mental  test  theory.  In  so  doing,  validity  indices  for  different  purposes 
of  testing  and  also  those  which  are  tailored  for  a  specific  population  of  examinees  have  been  considered. 

The  above  considerations  for  the  item  and  test  validities  may  be  just  part  of  many  possible  ap¬ 
proaches.  We  may  still  have  a  long  way  to  go  before  we  discover  the  most  useful  measures  of  the  item 
and  test  validities.  The  present  research  may  stimulate  other  researchers  so  that  they  will  pursue  this 
topic  further,  taking  different  approaches. 

We  notice  that  the  test  validity  measures  proposed  in  this  research  can  be  modified  by  using  one 
of  the  two  modification  formulae,  T(0)  and  E(0),  of  the  test  information  function  (cf.  Chapter  3),  in 
place  of  the  original  1(8).  This  will  be  investigated  in  the  future,  when  the  characteristics  of  these  two 
modification  formulae  have  further  been  investigated  and  clarified. 
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VI  Further  Investigation  of  the  Nonparametric  Approach  to 
the  Estimation  of  the  Operating  Characteristics  of  Dis¬ 
crete  Item  Responses 

In  the  present  research  a  method  has  been  proposed  which  increases  accuracies  of  estimation  of 
the  operating  characteristics  of  discrete  item  responses,  while  pertaining  to  the  two  features  described 
in  Section  2.3,  and  the  new  procedure  has  been  tested  upon  dichotomous  items.  It  has  proved  to  be 
effective,  especially  when  the  true  operating  characteristic  is  represented  by  a  steep  curve,  and  also  at 
the  lower  and  upper  ends  of  the  ability  distribution  where  the  estimation  tends  to  be  inaccurate  because 
of  smaller  numbers  of  subjects  involved  in  the  base  data.  Tentatively,  it  is  called  the  Differential  Weight 
Procedure,  and  it  belongs  to  the  Conditional  P.D.F.  Approach  (cf.  Chapter  2).  This  procedure  costs 
more  CPU  time  than  the  Simple  Sum  Procedure,  which  has  been  used  frequently  (cf.  Samejima,  1981, 
1988),  but  the  advantage  of  handling  more  than  one  item,  say,  fifty,  together  in  the  Conditional  P.D.F. 
Approach  is  still  there. 
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[VI. 1]  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  Com¬ 
bined  with  the  Normal  Approach  Method 

It  is  obvious  from  the  discussion  given  in  Chapter  2  that  the  Conditional  P.D.F.  Approach  combined 
with  the  Normal  Approach  Method  is  the  simplest  and  one  of  the  most  economical  procedures  in  CPU 
time.  Out  of  the  three  procedures  of  the  Conditional  P.D  F.  Approach  the  Simple  Sum  Procedure  is  the 
simplest  one  (cf.  Samejima,  1981).  For  this  reason,  the  combination  of  the  Simple  Sum  Procedure  of 
the  Conditional  P.D.F.  Approach  and  the  Normal  Approach  Method  has  most  frequently  been  applied 
for  simulated  and  empirical  data.  Fortunately,  in  spite  of  the  simplicity  of  the  procedure,  the  results 
with  simulated  data  in  the  adaptive  testing  situation  and  with  simulated  and  empirical  data  in  the 
paper-and-pencil  testing  situation  indicate  that  we  can  estimate  the  operating  characteristics  fairly 
accurately  by  using  this  combination  (cf.  Samejima,  1981,  1984).  This  seems  to  prove  the  robustness  of 
the  Conditional  P.D.F.  Approach.  For  one  thing,  there  is  a  good  reason  why  Normal  Approach  Method 
works  well,  for  the  conditional  distribution  of  r  ,  given  f  ,  is  indeed  normal  if  the  (unconditional) 
distribution  of  r  is  normal,  and  it  is  a  truncated  normal  distribution  if  the  (unconditional)  distribution 
of  r  is  rectangular,  and  the  truncation  is  negligible  for  most  of  the  conditional  distributions. 

In  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach,  the  operating  characteristic, 
Pkg(8)  ,  of  the  discrete  item  response  kg  of  an  unknown  item  g  is  estimated  through  the  formula 

(e-i)  A,(tf)  =  /£>(*)]  =  5>(r  I  ?.)[£>(*■  i  Ml"1  ■ 

aZkg  s  =  l 

where  s  (=  1,2,...,  JV)  indicates  an  individual  examinee,  and  4>{t  |  f,)  denotes  the  conditional  density 
of  r  ,  given  f,  .  This  conditional  density  is  estimated  by  using  the  estimated  conditional  moments  of 
r  ,  given  f,  ,  using  one  of  the  four  methods,  as  was  described  in  Section  2.3. 

In  the  Weighted  Sum  Procedure  of  the  Conditional  P.D.F.  Approach,  we  have  for  the  estimated 
operating  characteristic  of  kg 

_  N 

(6-2)  hg(8)  =  I  i 

*€fc,  »=1 

where  u>(r»)  is  the  weight  function  of  f,  .  When  we  combine  one  of  these  two  approaches  with 
the  Normal  Approach  Method,  4>{T  I  f„)  in  (6.1)  or  in  (6.2)  is  approximated  by  the  normal  density 
function,  using  the  first  two  estimated  conditional  moments  of  r  ,  given  r,  ,  which  are  given  by  (2.13) 
and  (2.14),  respectively,  as  its  parameters,  and  Of.  ,  in  the  formula 

(6.3)  <f>(T  |  r„)  =  [27r]-1/2(<7f.|-1  exp[-(r  -  in,)2/{2a}m}\  . 

[VI. 2]  Differential  Weight  Procedure 

If  we  accept  the  approximation  of  the  conditional  distribution  of  t  ,  given  r  ,  by  the  asymptotic 
normality,  as  we  do  in  these  approaches  (cf.  Samejima,  1981),  the  other  conditional  distribution,  i.e., 
that  of  r  ,  given  f  ,  will  become  more  or  less  incidental.  Thus  in  the  Bivariate  P.D.F.  Approach 
the  bivariate  distribution  of  r  and  f  is  approximated  for  each  separate  item  score  subpopulation  of 
subjects  of  each  unknown  test  item.  In  the  Conditional  P.D.F.  Approach,  however,  the  incidentality 
of  this  second  conditional  distribution  is  not  rigorously  considered,  and  the  implicit  assumption  exists 
such  that  for  the  fixed  value  of  f  the  conditional  distributions  of  r  are  similar  for  the  different  item 
score  subpopulations. 
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Take  the  dichotomous  response  level,  for  example.  On  this  level,  each  item  is  scored  “right”  or 
“wrong”,  “affirmative”  or  “negative”,  etc.  The  above  assumption  of  non-incidentality  may  be  acceptable 
when  the  operating  characteristic  of  the  correct  answer  of  the  item  is  represented  by  a  mildly  steep  curve, 
as  is  the  case  with  most  practical  situations,  and  the  questions  are  asked  to  subjects  whose  ability  levels 
are  compatible  with  the  difficulty  levels  of  the  questions,  as  is  the  case  with  adaptive  testing  and,  though 
less  rigorously,  with  many  cases  of  paper-and-pencil  testing. 

This  assumption  is  not  acceptable,  however,  when  the  operating  characteristic  of  the  correct  answer 
is  represented  by  a  steep  curve.  If  the  operating  characteristic  follows  the  Guttman  scale,  for  example, 
then  the  conditional  distributions  of  r  ,  given  ?  ,  for  the  two  separate  item  score  subpopulations 
are  distinctly  separated,  and  they  do  not  even  overlap!  If  we  use  the  Simple  Sum  Procedure  or  the 
Weighted  Sum  Procedure  for  an  item  which  nearly  follows  the  Guttman  scale,  therefore,  the  resulting 
estimated  operating  characteristics  of  the  correct  and  the  incorrect  answers  will  tend  to  be  flatter  than 
they  actually  are. 

This  problem  can  be  solved  by  estimating  differential  conditional  distributions  of  r  ,  given  f  ,  for 
the  separate  discrete  item  responses  to  an  “unknown”  item.  Let  4‘ki[T  |  ?)  denote  the  conditional 
density  of  r  ,  given  f  ,  for  the  subpopulation  of  subjects  who  share  the  same  discrete  item  response 
kg  to  an  “unknown”  item  g  .  We  can  write 

(6-4)  <Mr  |  f)  =  /;f( r)  tf>(r  |  r)  , 


where  /^(r)  indicates  the  density  of  r  for  the  subpopulation  of  subjects  who  share  kg  as  their 
common  item  score  of  item  g  ,  rji(f  |  r)  is  the  conditional  density  of  r  ,  given  r  ,  which  is  approximated 
by  the  normal  density,  n[r,  Cf1]  ,  and  gj^(r)  is  the  marginal  density  of  f  ,  for  this  subpopulation, 
and  for  which  we  have 

(6-5)  g'kjr)  =  f  fkjr)  ip(r  |  r)  dr  . 

J  —  OO 

We  notice  that  there  is  a  relationship 

(6-6)  fl,(r)  =  /*( t)  [  H  /*(r)  Pfc*,(r)  dr]-1  , 

J  —  oo 

where  / * (r )  denotes  the  density  of  r  for  the  total  population.  Since  we  have 


(6.7) 


<t>iT  I f)  -  /*(r)  W*  I T ) !?’(?)]  1  . 


where  g*  (?)  is  the  density  of  f  for  the  total  population  of  subjects  which  is  given  by 

(6.8)  9*(f)  =  f  /* (r)  i'(?  |  r)  dr  , 

J  —  OO 


from  the  above  formulae  we  obtain 

(6.9)  4>ks(r  \  ?)  =  4>(t  \  ?)  P^(r)  h(?) 
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where  h(r)  is  a  function  of  t  and  constant  for  a  fixed  value  of  f  .  Thus  4>ks{r  |  t)  is  a  density 
function  proportional  to  <j>(r  ]  f)  Pkg(r)  .  We  notice  that  4>(t  |  f)  in  this  formula  is  common  to  all  the 
item  scores  and  across  different  unknown  items,  while  Pk  (r)  is  a  specific  function  of  r  for  each  ky  . 
Since  f>[r  \  f)  can  be  estimated  by  one  of  the  four  methods  described  in  Section  2.3,  our  effort  should 
be  focused  on  finding  an  appropriate  differential  weight  function  for  each  kg  .  Let  Wkg(r)  denote  such 
a  differential  weight  function,  which  replaces  Pkg{f)  h(f)  in  (6.9).  Thus  we  can  revise  (6.1)  and  (6.2) 
into  the  forms 

N 

(6.10)  h'(9)  =  P^\t{6) ]  =  Y,  W«AT)*(r  I  Ml !  MI’1 

B£kg  «=1 

and 

N 

(6.H)  Pkg{6)  =  Pfc*Jr(0)]  =  Y  u,(f>)Wkg(T)HT  I  f»)l^2w(f>)wkg{r;s)<l>(r  I  ' 

s£kg  a  =  l 

Since  the  differential  weight  function  H/ka(r)  involves  Pkg  (r)  ,  which  itself  is  the  target  of  estimation, 
we  may  use  its  estimate,  ,  obtained  by  the  Simple  Sum  Procedure  or  by  the  Weighted  Sum 

Procedure,  as  its  substitute.  In  so  doing,  we  may  need  some  local  smoothings  of  Pk  (r)  where  the 
estimation  involves  substantial  amounts  of  error  because  of  locally  small  numbers  of  subjects  in  the  base 
data,  etc.  In  some  cases  we  may  need  several  iterations  by  renewing  the  differential  weight  functions 
on  each  stage  until  the  resulting  estimated  operating  characteristic  converges. 

(VI.  3]  Examples 

We  have  tried  this  proposed  method  on  the  simulated  data  provided  by  Dr.  Charles  Davis  of  the 
Office  of  Naval  Research,  using  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  combined 
with  the  Normal  Approach  Method  with  some  modifications  as  the  initial  estimate  of  /^(r)  in  the 
differential  weight  function.  These  data  are  simulated  on-line  item  calibration  data  of  the  initial  itempool 
calibration  based  upon  conventional  testing,  in  which  100  dichotomous  items  are  divided  into  four 
subtests  of  25  items  each,  and  each  subtest  has  been  administered  to  6,000  hypothetical  examinees, 
and  those  of  different  rounds  based  upon  adaptive  testing,  in  which  each  of  the  50  new  binary  items 
has  been  administered  to  a  subgroup  of  1,500  hypothetical  subjects  out  of  the  total  of  15,000  .  These 
hypothetical  examinees’  ability  distributes  unimodally  within  the  interval  of  6  ,  (-3.0,  3.0),  with  slight 
negative  skewness. 

For  the  purpose  of  illustration,  Figure  6-1  presents  the  results  of  the  Differential  Weight  Procedure 
using  the  results  of  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  combined  with  the 
Normal  Approach  Method  with  some  modifications  as  the  initial  estimates,  for  a  couple  of  items  of 
the  initial  itempool.  They  are  dichotomous  items,  and  were  intentionally  selected  from  those  items 
whose  true  operating  characteristics  of  the  correct  answer  are  non-monotonic,  in  order  to  visualize  the 
benefit  of  the  nonparametric  estimation  of  the  operating  characteristic.  In  each  graph,  also  presented 
for  comparison  is  the  best  fitted  operating  characteristic  of  the  correct  answer  following  the  three- 
parameter  logistic  model,  which  has  been  given  by  Dr.  Michael  Levine.  We  can  see  in  these  graphs  that 
the  resulting  estimated  operating  characteristics  are  fairly  close  to  the  true  ones,  and  that  they  reflect 
the  non-monotonicities.  The  reader  is  directed  to  ONR/RR-90-4  (Samejima,  1990)  for  more  examples. 


[VI. 4]  Sensitivities  to  Irregularities  of  Weight  Functions 

As  we  have  proceeded,  several  factors  have  been  identified  and  observed  which  affect  the  resulting 
estimated  operating  characteristics  substantially.  They  are  concerned  with  the  differential  weight  func- 
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PROBABILITY  PROBABILITY 


Two  Examples  of  the  Estimated  Operating  Characteristic  of  the  Correct  Answer 
Using  the  Differential  Weight  Procedure  (Dotted  Line),  in  Comparison  with 
the  TVue  Operating  Characteristic  (Solid  Line)  and  the  Best  Fitted 
Three-Parameter  Logistic  Curve  (Dashed  Line). 
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tion,  and  can  be  itemized  as:  l)  lower  end  ambiguities,  2)  upper  end  ambiguities,  3)  local  irregularities 
and  4)  overall  irregularities. 

Out  of  these  factors,  lower  and  upper  end  ambiguities  basically  come  from  the  fact  that  we  do  not 
usually  have  sufficiently  large  numbers  of  subjects  on  the  lowest  and  the  highest  ends  of  the  interval 
of  9  of  interest  upon  which  the  estimation  of  the  operating  characteristics  is  made.  Also  the  fact  that 
the  test  information  function  1(9)  is  used  in  the  transformation  of  9  to  r  which  is  specified  by 
(2.9)  may  have  something  to  do  with  these  ambiguities.  It  has  been  observed  (Samejima,  1979b)  that 
in  using  equivalent  items  following  the  Constant  Information  Model  (Samejima,  1979a)  the  speed  of 
convergence  of  the  conditional  distribution  of  the  maximum  likelihood  estimate  9  ,  given  9  ,  to  the 
asymptotic  normality  with  9  and  [/(0)]-1/2  as  its  two  parameters  substantially  differs  for  different 
levels  of  9  ,  »n  spite  of  the  fact  that  the  amount  of  test  information  is  constant  for  every  level  of  9  . 
To  be  more  specific,  the  convergence  is  observed  to  be  much  slower  at  those  levels  which  are  close  to 
either  end  of  the  interval  of  9  for  which  the  amount  of  test  information  is  non-zero  and  constant,  and 
faster  at  intermediate  levels  of  9  .  This  situation  can  be  ameliorated  if  we  replace  the  test  information 
function  1(9)  in  (2.9)  by  one  of  its  two  modified  forms  (cf.  Chapter  3),  T (9)  and  E(0)  . 

By  irregularity  we  mean  non-smoothness,  which  is  exemplified  by  an  unnatural  angle,  etc.  It  has 
been  observed  that  for  most  items  the  resulting  operating  characteristic  is  amazingly  sensitive  to  these 
irregularities  of  the  differential  weight  function.  In  order  to  observe  these  sensitivities,  Figure  6-2 
illustrates  how  these  irregularities,  which  are  involved  in  the  differential  weight  function,  affect  the 
resulting  estimated  operating  characteristic.  For  more  examples,  the  reader  is  directed  to  ONR/RR- 
90-4  (Samejima,  1990). 

The  effect  of  local  irregularities  is  most  interesting  to  observe  in  the  three  examples  presented  by 
Figure  6-2.  In  each  of  these  graphs,  the  artificially  irregular  differential  weight  function  for  the  correct 
answer  is  drawn  by  a  short  dashed  line,  and,  in  order  to  emphasize  its  irregularities,  it  was  proportionally 
enlarged  and  shown  by  a  long  dashed  line.  We  can  see  in  each  graph  that,  when  the  differential  weight 
function  has  an  unnatural  angle,  for  example,  the  resulting  estimated  operating  characteristic  of  the 
correct  answer  also  shows  an  unnatural  angle  at  approximately  the  same  level  of  9  .  We  can  also  see  in 
these  graphs  how  overall  irregularities  of  the  differential  weight  function  affect  the  resulting  estimated 
operating  characteristic,  and  how  sensitive  the  latter  is  to  the  former.  This  type  of  sensitivity  of  the 
resulting  estimated  operating  characteristic  to  the  irregularities  of  the  differential  weight  function  is 
encouraging  as  well  as  threatening,  for  it  promises  success  in  the  estimation  provided  that  we  succeed 
in  finding  the  right  differential  weight  function. 

During  the  present  research  period,  perhaps  the  author  and  her  research  assistants  have  spent  the 
greatest  amount  of  time  for  developing  this  method,  Differential  Weight  Procedure  of  the  Conditional 
P.D.F.  Approach.  Thus,  in  addition  to  the  results  exemplified  in  this  section  and  in  ONR/RR-90- 
4  (Samejima,  1990),  there  have  been  produced  so  many  other  results,  using  different  strategies  in 
specifying  differential  weight  functions,  etc.  The  research  will  be  continued  in  the  future,  and  those 
results  which  are  not  introduced  in  this  final  report  will  be  included  in  the  basis  upon  which  the  future 
research  will  be  founded  and  planned,  and  will  eventually  be  introduced  in  future  research  reports. 

[VI. 5]  Discussion  and  Conclusions 

A  new  procedure  of  nonparametric  estimation  of  the  operating  characteristics  of  discrete  item  re¬ 
sponses  has  been  proposed,  which  is  called  Differential  Weight  Procedure  of  the  Conditional  P.D.F'. 
Approach.  Some  examples  have  been  given,  and  sensitivities  of  the  resulting  estimated  operating  char¬ 
acteristics  to  irregularities  of  the  differential  weight  functions  have  been  observed  and  discussed.  These 
outcomes  suggest  the  importance  of  further  investigation  of  the  weight  function  in  the  future. 

To  summarize,  although  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  combined  with 
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PROBABILITY  PROBABILITY 


Three  Examples  of  the  Estimated  Operating  Characteristic  of  the  Correct  Answer 
Using  the  Differential  Weight  Procedure  (Dotted  Line),  in  Comparison  with  the 
True  Operating  Characteristic  (Solid  Line),  When  the  Differential  Weight 
Function  (Short  Dashed  Line)  Has  Irregularities.  The  Function  Was  Also 
Proportionally  Enlarged  and  Plotted  (Long  Dashed  Line)  to  Visualiie 
the  Angles  and  Other  Irregularities  Well. 
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FIGURE  6-2  (Continued) 


the  Normal  Approach  Method  works  reasonably  well  for  the  on-line  item  calibration  of  adaptive  testing, 
and  also  for  the  paper-and-pencil  testing,  especially  when  the  number  of  subjects  is  large,  if  we  wish 
to  increase  the  accuracy  of  estimation  we  can  use  the  Differential  Weight  Procedure.  The  disadvantage 
will  be  the  added  CPU  time,  so  we  need  to  consider  the  balance  of  the  cost  and  accuracy  of  estimation 
before  we  make  our  decision.  It  will  be  less  expensive,  however,  if  we  compare  the  CPU  time  required 
for  the  present  procedure  with  the  time  required  for  the  Bivariate  P.D.F.  Approach. 
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VII  Content-Based  Observation  of  Informative  Distractors 
and  Efficiency  of  Ability  Estimation 

Partly  because  of  the  availability  of  computer  software,  such  as  Logist  (Wingersky,  Barton  and  Lord, 
1982),  Bilog  (Bock  and  Atkin,  1981),  etc.,  it  is  a  common  procedure  among  researchers  that  they  mold 
the  operating  characteristics  of  correct  answers  into  the  three-parameter  logistic  model,  ignoring  their 
possible  non-monotonicity.  In  some  cases,  strategies  are  even  taken  so  that  distractors,  which  cause 
the  non-monotonicity,  are  considered  as  undesirable  and  are  replaced  by  some  other  non-threatening 
alternative  answers. 

A  question  must  be  raised  as  to  whether  this  strategy  is  wise.  In  this  chapter,  this  issue  will  be 
discussed  both  from  theory  and  from  practice,  and  a  new  strategy  of  writing  test  items,  which  leads 
to  more  efficient  ability  estimation,  will  be  proposed.  It  will  take  advantage  of  the  ease  in  handling 
mathematics  attributed  to  parameterization,  and  yet  minimize  the  effect  of  noise  caused  by  random 
guessing. 

[VII. 1]  Non-Monotonicity  of  the  Conditional  Probability  of  the  Positive 

Response,  Given  Latent  Variable 

This  section  deals  basically  with  the  essence  or  a  summary  of  the  paper  published  by  the  author 
more  than  twenty  years  ago  (Samejima,  1968),  as  one  of  the  research  reports  of  the  L.  L.  Thurstone 
Psychometric  Laboratory  of  the  University  of  North  Carolina.  The  content  of  the  paper  was  a  protocol 
which  led  to  the  proposal  of  a  new  family  of  models  for  the  multiple-choice  test  item  (Samejima, 
1979b).  The  author  believes  that  this  paper  published  in  1968  still  gives  new  ideas  to  today’s  research 
communities. 

The  paper  is  concerned  with  the  nominal  response,  and  also  multiple-choice  situations,  in  which 
examinees  are  required  to  choose  one  of  the  given  alternatives,  in  connection  with  the  graded  response 
model  (cf.  Samejima,  1969,  1972).  For  a  multiple-choice  item  a  certain  number  of  false  answers  are  given 
in  addition  to  the  correct  answer.  In  a  general  case  it  is  impossible  to  score  them  in  a  graded  manner  in 
accordance  with  their  degrees  of  attainment  toward  the  goal.  Thus  the  multiple-choice  situation  should 
be  treated  as  a  special  instance  of  the  nominal  level  of  response,  although,  in  addition,  the  problem  of 
random  or  irrational  choice  should  be  investigated. 

Confining  discussions  to  examinees  who  have  responded  to  item  g  incorrectly,  there  can  be  diversity 
of  false  answers  if  they  have  responded  to  it  freely,  without  being  forced  to  choose  one  of  a  set  of 
alternative  answers.  It  is  conceivable  that  some  of  the  false  answers  may  require  high  levels  of  ability 
measured  while  some  others  may  not,  some  may  be  related  to  the  ability  measured  strongly  while  some 
others  may  not,  etc.  An  objective  measure  of  the  plausibility  of  a  specified  false  answer  is  its  operating 
characteristic,  i.e.,  the  probability  of  its  occurrence  defined  for  a  fixed  value  of  ability  9  ,  and,  therefore, 
expressed  as  a  function  of  9  . 

Let  M,(9 )  be  a  sequence  of  the  conditional  probabilities  corresponding  to  the  cognitive  subprocesses 
required  in  finding  the  plausibility  of  response  k ^  to  item  g  ,  and  Uk,t(9)  be  the  conditional  probability 
that  an  examinee  discovers  the  irrationality  of  response  kg  as  the  answer  to  item  g  ,  on  condition 
that  he  has  already  found  out  its  plausibility.  The  operating  characteristic  of  k u  ,  which  is  denoted  by 
Pkg(Q)  i  can  be  expressed  by 

(7.i)  =  , 

*tk,f 

since  it  is  reasonably  assumed  that  an  examinee  who  gives  a  response  k,,  to  item  g  is  one  who  has 
succeeded  in  finding  kfl  's  plausibility,  and  yet  failed  in  finding  its  irrationality.  'vVe  notice  that  this 
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formula  is  exactly  the  same  in  its  structure  as  the  definition  of  PXg[9)  on  the  graded  response  level, 
where  M,(8)  is  replaced  by  Mi(8)  and  Ukg(9)  is  replaced  by  M(Xg  +  x]{9)  (cf.  Samejima,  1972). 
Defining  Mkg(6)  such  that 

(7-2)  Mka{d)=H  M,(8)  , 

8  tk  g 

we  can  rewrite  (7.1)  into 

(7-3)  Pkg(8)  =  Mkg(e)[l-Ukg(8)}  . 


It  will  reasonably  be  assumed  from  their  definitions  that  both  Mkg(8)  and  Ukg(8 )  be  strictly 
increasing  in  6  ,  provided  that  a  specified  response  kg  is  a  good  mistake  in  the  sense  that  the 
discoveries  of  its  plausibility  and  irrationality  are  properly  related  with  ability  6  .  It  will  also  be 
reasonably  assumed  that  the  upper  asymptotes  of  Mkg{8)  and  Ukg(8)  are  unity,  and  the  lower 
asymptote  of  Mkg(8)  is  zero. 

We  assume  that  both  Mkg(8)  and  Ukg(8)  are  three-times-differentiable  with  respect  to  8  .  It  is 
easily  observed  that,  in  order  to  satisfy  the  unique  maximum  condition  (Samejima,  1969,  1972),  Pk  (f?) 
defined  by  (7.3)  must  fulfill  the  following  inequalities: 

(7-4)  ~\ \ogMkg(8)  =  ±{^Mkg(8){Mkg(8)}-l\  <  0 

and 

(7-5)  ~2^g\l-Ukg{8)\=  ^l-±Ukg(6){l-Ukg(6)}-'}<0  . 

(For  proof,  see  Samejima,  1968.)  Note  that  in  this  case  the  lower  asymptote  of  Ukj[9)  need  not  be 
zero.  The  operating  characteristic  of  a  specified  response  kg  which  satisfies  the  unique  maximum 
condition  was  called  the  plausibility  curve  (Samejima,  1968),  and  later  the  plausibility  function  (cf. 
Samejima,  1984a).  As  the  condition  suggests,  the  plausibility  curve  is  necessarily  unimodal.  A  schema¬ 
tized  hypothesis  for  the  plausibility  curve  is  the  following.  The  probability  that  an  examinee  will  find 
the  plausibility,  but  will  fail  in  discovering  the  irrationality,  of  a  specified  response  kg  as  the  answer 
to  item  g  is  a  function  of  ability  8  ;  it  increases  as  ability  8  increases,  reaches  maximum  at  a  certain 
value  of  9  ,  and  then  decreases  afterwards.  If  an  item  provides  many  such  responses,  their  plausibility 
curves  will  be  powerful  sources  of  information  in  estimating  examinees’  abilities.  That  is  to  say,  we  can 
make  use  of  specific  wrong  answers  to  an  item  as  sources  of  information,  as  well  as  the  correct  answer. 

Let  Pg{9)  denote  the  operating  characteristic  of  the  correct  answer  of  a  dichotomous  item  g  in 
the  free-response  situation.  Let  Pg{8)  be  the  same  function,  but  in  the  multiple-choice  situation.  The 
conventional  three-parameter  model  is  represented  by 

(76)  PW)=ca  +  {\-ct)Pv{0)  , 


where  cg  is  the  probability  with  which  an  examinee  will  guess  correctly  (Lord  and  Novick,  1968). 
This  is  a  monotonically  increasing  function  of  8  with  cg  (>  0)  and  unity  as  its  lower  and  upper 
asymptotes,  provided  that  Pg[8)  is  strictly  increasing  in  9  with  zero  and  unity  as  its  lower  and  upper 
asymptotes. 
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The  psychological  hypothesis  which  has  led  to  the  formula  (7.6)  in  the  multiple-choice  situation  Ls 
the  following.  If  an  examinee  has  ability  6  ,  then  the  probability  that  he  will  know  the  correct  answer 
is  given  by  Pg(6)  ;  if  he  does  not  know  it,  he  will  guess  randomly,  and,  with  probability  ca  ,  will  guess 
correctly  (Lord  and  Novick,  1968).  Thus  we  have  for  the  operating  characteristic  of  the  correct  answer 
of  item  g  in  the  multiple-choice  situation 

(7-7)  Pg{8)  +  [l-Pg(8)}cg  , 

which  leads  to  (7.6).  This  hypothesis  may  not  necessarily  be  appropriate  for  ability  measurement.  One 
can  never  tell  in  the  measurement  of  a  reasoning  ability,  for  instance,  whether  an  examinee  knows  the 
correct  answer  to  item  g  or  not,  until  he  has  tried  to  solve  it.  He  may  respond  with  an  incorrect 
alternative  without  guessing  at  all.  To  explain  such  a  case  we  need  some  other  hypothesis  than  the  one 
which  leads  to  the  formula  (7.6). 

Hereafter,  we  assume  that  Pg(8)  is  strictly  increasing  in  6  with  zero  and  unity  as  its  lower  and 
upper  asymptotes,  and  is  twice-differentiable  with  respect  to  8  .  Suppose,  further,  that  both  Pg[8) 
and  [1  —  P8(0)|  satisfy  the  unique  maximum  condition.  In  this  case  P'(6 )  defined  by  (7.6)  does 
not  satisfy  either  of  Conditions  (i)  and  (ii)  for  the  unique  maximum,  unless  cg  is  zero,  i.e.,  the  free- 
response  situation,  although  they  are  fulfilled  for  the  negative  answer  to  item  g  (cf.  Samejima,  1973). 
Observations  and  discussion  are  made  (Samejima,  1968)  giving  two  simple  cases  of  the  multiple-choice 
situation  as  examples.  In  those  examples,  only  two  items  are  involved,  and  the  response  pattern,  (1,0), 
is  solely  treated,  and  precise  mathematical  derivations  are  given. 

A  possible  correction  for  the  conventional  functional  formula  for  the  operating  characteristic  of  the 
correct  answer  of  a  multiple-choice  item  can  be  made  by  introducing  the  probability  of  random  guessing 
defined  for  a  fixed  value  of  6  .  Let  dll(8)  denote  this  probability.  A  reasonable  assumption  for  this 
function  may  be  that  it  be  non-increasing  in  8  .  Thus  the  probability  with  which  an  examinee  of  ability 
6  will  answer  item  g  correctly  by  following  the  due  cognitive  process  is  expressed  by  (1  —  d„(0)]Ptf(0)  ; 
and  the  one  with  which  he  will  give  the  correct  answer  by  guessing  should  be  dg[9)cg  .  For  economy  of 
notation,  let  Pg{8)  be  the  operating  characteristic  of  the  correct  answer  to  item  g  in  the  corrected 
functional  formula  also.  We  can  write 

(7.8)  p;(6)  =  [1  -  dg{0)\Pa{0)  +  dg(e)cg 

=  Pg(0)  +  dg(8)\cg  -  Pg(8) |  . 


A  schematized  psychological  hypothesis  which  leads  to  this  formula  is  as  follows.  If  an  examinee 
has  ability  6  ,  then  he  will  depend  upon  random  guessing  in  answering  item  g  with  probability 
du(8)  ;  in  that  case,  the  conditional  probability  with  which  he  will  guess  correctly  is  given  by  cg  .  If 
he  does  not  depend  upon  random  guessing,  he  will  try  to  solve  the  item  by  the  due  cognitive  process, 
and  will  succeed  in  solving  it  with  probability  Pg(8)  .  Thus  according  to  this  functional  formula  the 
probability  with  which  an  examinee  will  respond  with  an  incorrect  alternative  without  guessing  is  given 
by  [l  —  dg(8)]\  1  -  Pg{8)\  ,  which  is  nil  in  the  model  represented  by  the  formula  (7.6). 

We  can  conceive  of  several  factors  which  may  affect  the  functional  formula  for  dg(6)  .  The  difficulty 
of  item  g  may  be  one  of  them;  the  discriminating  power  may  be  another;  the  number  of  alternatives 
attached  to  item  g  may  also  affect  the  probability,  i.e.,  it  may  be  that  the  fewer  the  number  of  alter¬ 
natives,  the  more  tempted  to  depend  upon  random  guessing  an  examinee  will  be;  also  the  plausibilities 
of  the  alternatives  may  be  counted  as  a  factor. 

In  a  simplified  case  where  dg(6)  is  constant  throughout  the  whole  range  of  8  ,  we  can  rewrite  (7.8) 

in  the  following  form. 
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(7.9) 


P;{d)=dgca  +  \l-dg\Pg(6)  ■ 


This  is  somewhat  similar  to  formula  (7.6),  the  conventional  functional  formula  for  the  operating 
characteristic  of  the  correct  answer  of  a  multiple-choice  item.  The  lower  asymptote  of  the  present 
function  is  dgcg  (<  cg)  ,  however,  while  it  is  cg  in  (7.6);  the  upper  asymptote  of  the  present  function  is 
(l  —  dg (1  —  ctf)j  ,  which  can  be  leas  than  unity,  while  it  is  unity  in  (7.6).  In  a  special  case  where  dg  =  0  , 
that  is,  an  examinee  tries  to  solve  item  g  by  proper  reasoning  with  probability  one,  (7.9)  reduces 
to  Pg{9)  ,  the  operating  characteristic  of  the  correct  answer  in  the  free-response  situation.  In  another 
special  case  where  dg  =  1  ,  that  is,  an  examinee  depends  upon  random  guessing  with  probability  one, 
(7.9)  reduces  to  a  constant,  cg  .  In  the  more  general  case  where  dg(6)  varies  as  8  varies,  it  is  observed 
from  (7.8)  that 

0  <  Pg(6)  <  P;{8)  <  cg  ;  if  9<60 

Pg,(0)  =  cg  =  Pg(8)  ;  if  9  =  90 

ca<P*(9)<Pg(6)<  1  -if  9>0o 

where 

(7.11)  8a  =  P;'[cg)  . 

provided  that  cg  is  greater  than  zero.  This  result  is  quite  natural,  since  it  is  reasonably  assumed  that 
the  probability  of  success  in  solving  item  g  will  decrease  by  random  guessing  if  the  one  attained  by 
the  due  cognitive  process  is  higher  than  the  one  attained  by  random  guessing,  and  it  will  increase  by 
random  guessing  if  the  latter  probability  is  higher  than  the  former.  If  we  assume  that  the  asymptotes 
of  dg(8)  in  negative  and  positive  directions  be  unity  and  zero,  respectively,  we  will  obtain  cg  and 
unity  as  the  lower  and  upper  asymptotes  of  Pg{8)  .  Figure  7-1  presents  two  examples  of  the  operating 
characteristic  given  by  (7.8)  where  cg  is  0.2  ,  using  two  different  dg(8)  's  .  Note  that  there  is  a 
dip  on  the  lower  part  of  the  curves  for  Pg(8)  .  These  two  dg(9)  's  are  identical  for  the  lower  levels  of 

8  ,  but  differ  on  the  upper  levels,  with  the  upper  asymptotes  0.0  and  0.1  ,  respectively.  In  these 
examples,  therefore,  the  upper  asymptote  of  P‘(9)  is  unity  in  the  first  example,  and  0.92  in  the 
second,  i.e.,  the  conditional  probability  for  the  correct  answer  never  approaches  unity  however  high  the 
ability  may  be. 

If  dg{6)  is  differentiable,  Pg[9 )  is  also  differentiable,  and  from  (7.8)  we  have 
(7-12)  §jP;(6)  =  [1  -  dg(8)}^Pg(8)  +  \Cg  -  Pg(6)}-^d„(9)  . 

Thus  it  is  obvious  that  Pg[9)  is  strictly  increasing  in  6  for  the  range  9  >  0()  ,  if,  and  only  if,  dg(6) 
is  less  than  unity  for  the  range  of  9  satisfying  9  >6 ()  .  Thus  in  this  case  PJ  (9)  is  non-decreasing  in 

9  throughout  its  whole  range.  In  general,  Pg{8)  equals  cg  and  presents  a  horizontal  line  as  far  as 
dg(8)  is  unity,  and  then  increases  for  the  rest  of  the  range  as  8  increases. 

As  for  the  range  expressed  by  8  <  8f,  ,  P*  [8)  equals  cg  regardless  of  the  value  of  P,(0)  for  the 
values  of  9  for  which  dg(9)  is  unity,  and  is  some  positive  value  less  than  cfl  otherwise.  If  dfl(6 )  is 
unity  throughout  this  range  of  9  ,  Pg{9)  presents  a  horizontal  line  for  this  range.  If  d,,{9)  is  unity 
for  the  negative  extreme  value  of  8  ,  but  dg(8)  takes  on  some  values  less  than  unity  for  a  subset  of 
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Relationships  among  Pg{8)  ,  dg(&)  and  Pg(8)  Using  Two  Different  dg(8)  's  . 


6  of  this  range,  Pg(8)  has  at  least  one  local  minimum.  If  dg(9)  is  less  than  unity  for  the  negative 
extreme  value  of  9  ,  Pg(8)  can  be  strictly  increasing  in  8  ,  non-decreasing,  or  have  one  or  more  local 
minima,  in  accordance  with  the  functional  formula  for  dg(8)  . 

It  is  obvious  that  any  operating  characteristic  having  local  minima  does  not  satisfy  the  unique 
maximum  condition  (Samejima,  1969,  1972),  and  neither  does  the  one  whose  first  derivative  equals  zero 
at  some  value  of  8  .  In  the  case  of  Pg(8)  defined  by  (7.8)  we  can  prove  that,  in  general,  it  does  not 
satisfy  the  unique  maximum  condition,  even  if  it  is  strictly  increasing  in  6  .  (For  proof,  see  Samejima, 
1968.) 

Two  characteristics  of  the  model  represented  by  (7.8)  are  that  it  allows  dips,  and  also  a  smaller  value 
than  unity  for  the  upper  asymptote  of  the  operating  characteristic  of  the  correct  answer,  as  Figure  7-1 
illustrates.  In  these  examples,  there  is  only  one  dip  on  the  lower  level  of  6  .  There  can  be  more  than 
one,  however,  and  an  example  is  presented  elsewhere  (Samejima,  1968).  In  many  cases  the  model  may 
describe  the  real  operating  characteristic  of  the  correct  answer  more  closely  than  the  three-parameter 
model. 

It  has  been  reported  by  several  researchers  that  they  have  come  across  estimated  operating  charac¬ 
teristics  of  correct  answers  that  do  not  converge  to  unity,  but  to  some  other  values  less  than  unity.  Note 
that  the  general  model  described  above  can  handle  such  situations,  although  most  of  the  other  models 
proposed  by  different  researchers  so  far  cannot. 

We  notice  that  neither  (7.6)  nor  (7.8)  explicitly  takei  into  consideration  the  influences  of  separate 
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distractors.  Suppose  an  examinee  A  has  chosen  to  solve  item  g  by  reasoning,  i.e.,  without  guessing,  and 
has  reached  an  answer  which  is  not  correct.  Suppose,  further,  that  this  specified  response  is  not  given 
as  an  alternative  answer  to  this  item.  Then  either  he  will  decide  to  give  an  answer  by  guessing,  or  he 
will  try  to  solve  the  item  by  reasoning  all  over  again.  To  account  for  these  possibilities,  we  would  have 
to  give  practically  all  the  different  plausible  responses  to  item  g  as  its  alternatives,  which  is  practically 
impossible,  since  the  number  of  alternative  answers  is  more  or  less  restricted.  In  contrast  to  this,  it 
is  interesting  to  note  that  the  psychological  hypothesis  behind  the  three-parameter  logistic  model  may 
be  more  realistic  in  the  case  where  no  very  plausible  responses  except  for  the  correct  answer  to  item 
g  are  given  as  its  alternative  answers.  Thus,  even  if  an  examinee  has  reached  a  specified  plausible 
response  other  than  the  correct  answer,  he  may  turn  to  random  guessing  simply  because  he  cannot  find 
that  specified  answer  among  the  alternatives.  Such  a  situation  has  another  serious  problem,  however, 
since  it  is  likely  for  an  examinee  who  is  highly  alternative-oriented  to  choose  the  correct  answer  without 
much  reasoning  or  guessing,  simply  because  the  other  alternatives  are  too  ridiculous  to  be  the  answer 
to  the  item.  As  the  result,  the  operating  characteristic  of  the  correct  answer  may  be  deformed  so  that 
it  has  a  lower  difficulty  and  less  discriminating  power.  Plausible  answers  as  distractors  are  necessary  as 
alternatives  in  order  not  to  destroy  the  nature  of  the  item. 

It  is  conceivable  that  the  plausibilities  of  the  alternatives  attached  to  item  g  other  than  the  correct 
answer  will  be  one  of  the  factors  affecting  the  probability  of  random  guessing  in  the  multiple-choice 
situation.  For  this  reason,  here  we  shall  suppose  that  an  examinee  will  try  to  solve  the  item  following 
proper  cognitive  processes  at  the  beginning,  and  only  in  the  case  where  he  has  reached  an  answer  which 
is  not  given  as  an  alternative,  or  where  he  has  failed  to  find  any  answer  at  all,  he  will  guess. 

Let  kg  or  hg  denote  a  specified  response  to  item  g  which  is  given  as  an  alternative,  including  the 
correct  answer,  and  Pkg(9)  or  Phg(8)  be  its  operating  characteristic  in  the  free-response  situation. 
It  may  reasonably  be  assumed  that  P*?(0)  is  less  than  or  equal  to  unity  for  any  fixed  value  of 

6  .  Let  Pk  (0)  or  Pk  (0)  denote  the  operating  characteristic  of  a  specified  alternative  kg  or  hg  in 
the  multiple-choice  situation,  and  ck g  or  ckg  be  the  probability  of  choosing  kg  or  hg  by  guessing, 
which  satisfies 

(7-13)  XX  =  1  ' 

kg 

Thus  we  can  write 

(7.14)  P'kg(e)  =  pkg[e)  +  (1  -  XX(*)1 

hg 

for  any  kg  ,  and,  by  using  the  notation  for  the  correct  answer  as  we  did  in  the  previous  sections,  we 
obtain 


(7.15)  p;{0)  =  pa(0)  +  Il-IXWK  • 

hg 

It  is  worth  noting  that  we  have  specified  not  only  the  operating  characteristic  of  the  correct  answer 
in  the  multiple-choice  situation,  but  also  of  each  distractor.  The  utility  of  the  operating  characteristic 
of  each  wrong  alternative  answer  in  the  estimation  of  an  examinee’s  ability,  as  well  as  the  one  of  the 
correct  response,  is  suggested,  and  this  is  a  feature  of  the  present  discussion. 

It  has  been  made  clear  that,  in  general,  P*  (#)  does  not  satisfy  the  unique  maximum  condition 
regardless  of  the  functional  formulae  for  the  plausibility  curves  of  the  distractors.  As  for  the  alternatives 
other  than  the  correct  answer,  it  can  easily  be  shown  that,  in  general,  Pf  (<?)  does  not  satisfy  the 
unique  maximum  condition  (cf.  Samejima,  1968,  1979b). 
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Operating  Characteristic  of  the  Correct  Answer  in  the  FVee-Response  Situation 
(Solid  Line)  and  in  the  Multiple-Choice  Situation  (Dashed  Line),  in  the  Case 
Where  Only  Two  Alternatives  Are  Given;  Also  the  Operating  Characteristic 
of  the  Other  Alternative  in  the  FVee-Response  Situation  (Solid  Line)  Is 
Plotted  from  the  Ceiling;  cg  =  ckg  =  0.5  . 


For  the  purpose  of  illustration,  Figure  7-2  presents  a  simple  example  in  which  only  two  alternatives, 
the  correct  answer  and  one  incorrect  response,  are  given.  In  this  example,  P£  (0)  for  the  wrong  answer 
is  drawn  from  the  ceiling  in  order  to  make  the  picture  visibly  understandable.  A  normal  ogive  function 
given  by 


(7.16)  Pg{6)  =  -j=z  J  exp{  — u2/2}  du 

with  ag  —  1/1.48  and  bg  =  0.36  is  used  as  the  operating  characteristic  of  the  correct  answer,  and  the 
same  formula  is  applied  for  1/^(0)  and  A/*B(0)  for  the  incorrect  response.  The  corresponding  values 
of  the  parameters  are  (1/1.23)  and  -1.84  for  Mkt(6)  ,  and  (1/1.51)  and  -0.83  for  (/*„($)  ■ 
The  value  of  cg  ,  as  well  as  that  of  c*9  for  the  incorrect  answer,  is  0.5  . 

It  is  obvious  from  the  above  observations  and  discussion  that  these  are  the  fundamental  philosophies 
which  led  to  the  proposal  of  the  new  family  of  models  for  the  multiple-choice  test  item  (Samejima, 
1079b).  These  philosophies  will  provide  us  with  the  idea  of  content-based  observation  of  informative 
distractors  and  strategies  of  writing  test  items,  which  will  be  proposed  in  a  later  section.  The  general 
model  described  here  is  called  Informative  Distractor  Model,  in  contrast  with  the  Equivalent  Distractor 
Model,  to  which  the  three-parameter  model  represented  by  (7.6)  belongs  (cf.  Samejima,  1979b). 
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[VII. 2]  Effect  of  Noise  in  the  Three-Parameter  Logistic  Model  and  the 

Meanings  of  the  Difficulty  and  Discrimination  Parameters 

It  is  still  a  common  procedure  among  researchers  to  adopt  the  three-parameter  logistic  model,  which 
is  represented  by  (3.11)  in  Section  3.2,  for  their  multiple-choice  test  items  and  compare  the  resulting 
estimated  discrimination  parameters,  or  the  difficulty  parameters,  across  different  items.  An  important 
fact  that  is  overlooked  is  that  this  is  not  legitimate,  for  the  addition  of  the  third  parameter  cg  makes 
the  other  two  item  parameters  lose  their  original  meanings.  If  ag  =  1.00  and  cg  =  0.25  in  the 
three-parameter  logistic  model,  for  example,  this  corresponds  to  ag  =  0.75  in  the  logistic  model  in 
the  maximum  discrimination  power.  If,  in  addition  to  these  parameter  values,  bg  =  0.00  ,  then  the 
difficulty  level  for  the  three-parameter  logistic  model  defined  as  the  level  of  8  at  which  chances  for 
success  are  0.5  is  -0.4077336  ,  i.e.,  substantially  lower  than  0.00  . 

In  general,  we  can  write 

Otg  =  (1  -  Cg)  ag 

Pg  =  bg  +  ( Dag)~l  log  (1  -  2 cg)  , 

where  ag  denotes  the  actual  discrimination  power  and  fig  is  the  actual  difficulty  level  in  the  three- 
parameter  logistic  model.  As  we  can  see  in  (7.17),  the  effect  of  the  third  parameter  cg  can  be 
substantial,  both  on  the  discrimination  power  ag  and  on  the  difficulty  index  /?„  .  Thus  the  simple 
comparison  of  the  values  of  ag  for  two  or  more  test  items  having  different  values  of  the  lower  asymptote 
cg  is  illegitimate  and  can  be  harmful,  for  the  factor  (1  —  c0)  may  affect  the  value  of  ag  ,  the  real 
discrimination  power,  substantially.  As  for  the  difficulty  index,  since  the  second  term  on  the  right  hand 
side  of  the  second  equation  of  (7.17)  is  always  negative  for  0  <  cg  <  0.5  ,  this  term  represents  the 
amount  of  decrement  of  the  difficulty  level.  Note  that  as  cg  tends  to  0.5  ,  Pg  approaches  negative 
infinity!  (If  cg  >  0.5  then  0g  does  not  even  exist.)  The  illegitimacy  of,  and  the  danger  in,  comparing 
bg ’s  across  two  or  more  test  items  having  different  lower  asymptotes  cg  is  even  more  obvious  for  the 
difficulty  index. 

It  is  obvious  from  theory  that  in  both  the  logistic  and  the  three-parameter  logistic  models  the 
derivative  of  the  operating  characteristic  of  the  correct  answer  is  highest  at  9  =  bg  .  Actually,  the 
derivatives  are:  Dagj 4  and  (1  —  cg)Dag/A  ,  respectively.  The  ratio  of  this  maximal  slope  between  the 
three-parameter  logistic  model  and  the  logistic  model  is  (1  —  cg)  ,  which  equals  0.75  when  cg  =  0.25  , 
and  is  as  low  as  0.50  when  cg  =  0.50  .  The  corresponding  ratio  between  the  three-parameter  logistic 
model  and  the  normal  ogive  model  is  approximately  0.938687718(1  —  cg)  ,  which  is  a  little  less  than 
(l-c0)  • 

Figure  7-3  illustrates  that  several  sets  of  substantially  different  parameter  values  in  the  three- 
parameter  logistic  model  can  produce  very  similar  operating  characteristics  of  the  correct  answer.  We 
can  tell  that  the  differences  in  the  values  of  the  discrimination  and  difficulty  parameters  for  these  items 
are  substantial,  and  yet  the  resulting  curves  are  very  close  to  each  other  for  a  wide  range  of  9  .  Simple 
comparison  of  the  two  estimated  discrimination  parameters  is  illegitimate,  therefore,  when  the  estimated 
guessing  parameters  prove  to  be  different  from  each  other,  as  is  usually  the  case  with  actual  data.  Since 
the  estimation  of  the  third  parameter  cg  tends  to  be  most  inaccurate,  this  example  indicates  the  dan¬ 
ger  in  direct  comparisons  of  the  estimated  discrimination  parameters,  and  also  the  estimated  difficulty 
parameters,  across  the  items. 

In  most  cases  the  estimated  guessing  parameter  of  a  multiple-choice  test  item  provides  us  with  some 
other  value  than  the  reciprocal  of  the  number  of  the  alternative  answers.  It  is  reported  that  in  some  cases 
the  estimated  cg  takes  on  quite  high  values  (cf.  Lord,  1980,  Section  2.2).  These  phenomena  suggest 
that  the  philosophy  behind  the  model  is  unrealistic.  Researchers  using  the  three-parameter  logistic 
model  argue,  however,  that  it  still  is  a  convenient  approximation  to  real  operating  characteristics  of 


(7.17) 
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Examples  of  the  Operating  Characteristics  of  the  Correct  Answer  in  the  Three-Parameter  Logistic 
Model  (Dotted  Lines),  Together  with  the  One  in  the  Logistic  Model  with  o„  =  1.00  and 
bg  =  —0.64  (Solid  Line).  The  Parameters  for  the  Four  Functions  in  the  Order  of  ag  ,  bg 
and  cg  are:  1.05,  -0.52,  0.10;  1.10,  -0.40,  0.20;  1.15,  -0.27,  0.30;  1.20,  -0.13,  0.40; 

Respectively. 


correct  answers,  because  of  its  simplicity  in  mathematics.  In  a  way  it  is  true.  The  effective  use  of 
the  three-parameter  model  cannot  be  realized,  however,  unless  we  know  the  problems  attributed  to 
the  model,  and  use  the  model  in  such  a  way  that  these  weaknesses  will  not  cause  too  much  noise  and 
inefficiency. 

Investigation  of  the  problems  encountered  when  we  apply  the  three-parameter  logistic  model  to  the 
data  which  actually  follow  the  normal  ogive  model  was  made  earlier  (Samejima,  1984b).  The  data 
used  in  the  study  are  simulated  data  for  two  samples  of  500  and  2,000  hypothetical  examinees, 
respectively,  sampled  from  the  uniform  ability  distribution  for  the  interval  of  6  ,  (-2.5,  2.5).  In  order 
to  investigate  the  effect  of  the  number  of  test  items  on  the  resultant  estimated  parameters  obtained  by 
Logist  5,  we  used:  1)  Ten  Item  Test  and  2)  Thirty-Five  Item  Test,  both  of  which  consist  of  binary  items 
following  the  normal  ogive  model.  The  response  pattern  for  each  hypothetical  subject  was  produced 
by  the  Monte  Carlo  Method.  Combining  these  two  hypothetical  tests,  we  observed  the  results  of:  3) 
Forty-Five  Item  Test,  and,  in  addition,  we, observed  the  results  of  rather  artificially  created:  4)  Eighty- 
Item  Test  (cf.  Samejima,  1984b). 

These  results  suggest  that  there  exists  a  substantial  effect  of  the  assumed  third  parameter,  cg  ,  on  the 
other  two  estimated  item  parameters,  if  the  estimation  is  made  by  molding  the  operating  characteristic 
of  the  correct  answer  into  that  of  the  three-parameter  logistic  model,  when  actually  it  follows  the  normal 
ogive  model.  This  effect  appears  to  be  stronger  on  the  estimated  discrimination  parameter  than  on  the 
estimated  difficulty  parameter.  In  order  to  amend  these  enhancements,  the  discrimination  shrinkage 
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factor  and  the  difficulty  reduction  index  were  proposed  (Samejima,  1984b)  by  formulae  (7.19)  and  (7.21) 
respectively. 


(7.18)  ag  =  g(c'g)  ag  . 

(7.19)  fK)  =  -log(l-2c‘)  log(l  +  c;)-log(l-c*)_1  . 

(7.20)  %  =  bg  +  e(c;  |a„)  . 

(7.21)  £(c*  |  ag)  =  (Dag)-1  log(l  +  cj)  -  log(l  -  e*g)  . 


In  these  formulae,  a*  ,  6*  ,  and  c*  indicate  the  estimated  item  discrimination,  difficulty  and  guessing 
parameters  when  the  three-parameter  logistic  model  is  assumed,  respectively.  Some  resulting  estimated 
operating  characteristics  of  the  correct  answer  turned  out  to  be  disastrously  different  from  the  theo- 
rectical  functions,  especially  when  only  ten  binary  test  items  were  included.  We  find  no  substantial 
differences  between  the  results  of  500  Subject  Case  and  2,000  Subject  Case,  indicating  that  increasing 
the  number  of  subjects  from  500  to  2,000  does  not  provide  us  with  a  substantial  gain. 

It  has  been  pointed  out  that  the  three-parameter  logistic  model  does  not  satisfy  the  unique  maximum 
condition  for  the  likelihood  function,  and  this  topic  has  been  thoroughly  discussed  (Samejima,  1973). 
The  expected  loss  of  item  information  for  a  fixed  value  of  9  is  given  by 

(7.22)  Ig(6)  -  rg{9)  =  cgD*al\{Ue)}2{l~M0)}}\cg  +  {l-cM9)\-1  - 

where 

(7.23)  i>g{6)  =  (1  +  exp{-£>ag((0)  -  6ff)}]_1  , 

and  Ig(9)  and  /*(0)  are  the  item  information  functions  in  the  logistic  and  the  three-parameter  logistic 
models,  respectively.  We  have  for  the  critical  value  6g  ,  below  which  the  information  provided  by  the 
correct  answer  to  the  item  following  the  three-parameter  logistic  model  assumes  negative  values 

(7.24)  ig  =  bg  +  (2Dag)~i\ogcg  , 

which  is  strictly  increasing  with  the  increase  in  the  parameter  value  cg  ,  and  also  in  ag  and  in  bg  . 
If,  for  example,  ag  =  1.00  and  bg  =  0.00  ,  6^  =  —0.473364  for  cg  =  0.20  ,  and  6^  =  —0.407734  for 
cg  =  0.25  .  They  are  considerably  high  values  relative  to  bg  . 

An  important  implication  is  that  9_g  is  the  point  of  9  below  which  the  existence  of  a  unique 
maximum  likelihood  estimate  is  not  assured  for  all  the  response  patterns  which  include  the  correct 
answer  to  item  g  .  Although  this  warning  has  been  ignored  by  most  researchers  for  many  years,  a 
recent  research  (Yen,  Burket  and  Sykes,  in  press)  points  out  this  is  happening  much  more  often  than 
people  might  think. 

It  has  been  pointed  out  (Samejima,  1979a,  1982a)  that  there  L  a  certain  constancy  in  the  total 
amount  of  item  information,  regardless  of  the  parameter  values  and  of  specific  functional  formulae  for 
the  operating  characteristic  of  the  correct  answer.  If,  for  example,  the  model  belongs  to  Type  A,  i.e. , 
the  operating  characteristic  of  the  correct  answer  is  monotone  increasing  with  zero  and  unity  as  its 
lower  and  upper  asymptotes,  respectively,  then  the  total  area  under  the  curve  of  the  square  root  of  the 
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item  information  function  will  equal  ir  .  If  the  model  belongs  to  Type  B,  i.e.,  the  same  as  Type  A 
except  that  the  lower  asymptote  of  the  operating  characteristic  of  the  correct  answer  is  greater  than 
zero,  as  is  the  case  with  the  three-parameter  logistic  model,  then  the  total  area  will  become 

(7.25)  7T  -  2tan_1[c9(l- cs)_1]1/2  , 

with  the  second  and  last  term  as  the  loss  in  the  amount  of  total  item  information.  This  last  term 
is  strictly  a  function  of  cg  .  When  cg  =  0.20  ,  for  example,  the  total  amount  of  item  information 
reduces,  approximately,  to  0.705tt  ,  and  when  cg  =  0.25  it  is  approximately  equal  to  0.667?r  . 
More  observations  concerning  the  effect  of  noise  in  the  three-parameter  logistic  model  have  been  made 
elsewhere  (Samejima,  1982b). 

As  all  the  above  observations  indicate,  the  addition  of  the  third  parameter,  cg  ,  to  the  logistic  model 
creates  many  negative  results.  We  have  seen  that  these  negative  effects  are  greater  for  larger  values  of 
cg  .  In  using  the  three-parameter  logistic  model  as  an  approxima*'on  to  real  operating  characteristics, 
therefore,  we  need  to  take  these  facts  into  consideration.  Among  others,  if  we  are  in  a  situation  where 
we  can  modify  or  revise  our  items,  we  must  try  to  reduce  the  effect  of  noise  coming  from  cg  as  much 
as  possible.  Strategies  of  writing  the  multiple-choice  test  items  must  be  considered  accordingly. 

[VII. 3]  Informative  Distractors  of  the  Multiple-Choice  Test  Item 

So  far  most  observations  and  discussion  have  been  focused  on  theory.  Applications  of  certain  non- 
parametric  methods  of  estimating  the  operating  characteristics  for  some  empirical  data  have  revealed, 
however,  that  many  multiple-choice  test  items  do  not  follow  the  three-parameter  model,  nor  do  they 
follow  the  Equivalent  Distractor  Model  in  general,  to  which  the  three-parameter  logistic  model  belongs. 
Those  items  can  best  be  interpreted  by  the  Informative  Distractor  Model. 

Figure  7-4  presents  an  example  of  the  set  of  operating  characteristics  of  the  four  alternative  answers 
to  an  item  taken  from  the  Level  11  Vocabulary  Subtest  of  the  Iowa  Tests  of  Basic  Skills  (Samejima, 
1984a),  which  was  estimated  by  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach  com¬ 
bined  with  the  Normal  Approach  Method  (cf.  Section  6.1).  We  can  see  in  this  figure  that  each  distractor 
has  its  own  unique  operating  characteristic,  or  plausibility  function ,  and  also  that  the  estimated  oper¬ 
ating  characteristic  of  the  correct  answer  is  fairly  close  to  the  one  in  the  normal  ogive  model,  which 
is  drawn  by  a  solid  line  in  the  figure.  This  set  of  operating  characteristics  can  better  be  represented 
by  one  of  the  family  of  models  proposed  for  the  multiple-choice  test  item,  which  was  originated  by  the 
philosophy  described  in  the  preceding  section  and  takes  account  of  the  unique  information  provided  by 
each  distractor  as  well  as  the  effect  of  the  examinees’  random  guessing  behavior  (cf.  Samejima,  1979b). 
Figure  7-5  illustrates  the  operating  characteristic  of  the  correct  answer  in  Model  A.  We  can  see  that 
it  is  very  close  to  the  one  in  the  normal  ogive  model  which  is  drawn  by  a  dotted  line,  except  for  the 
lower  part  of  the  curve,  the  conditional  probability  of  success  which  is  almost  entirely  caused  by  random 
guessing.  In  cases  like  this,  it  will  be  wise  to  approximate  the  curve  by  the  normal  ogive  function  by 
discarding  the  item  response  m  estimating  lower  ability,  since  it  provides  us  with  nothing  but  noise,  as 
was  discussed  in  the  preceding  section. 

Detailed  observations  for  the  plausibility  functions  of  distractors  are  made  elsewhere  (Samejima, 
1984a)  for  the  forty-three  items  of  the  Level  11  Vocabulary  Subtest  of  the  Iowa  Tests  of  Basic  Skills. 
Similar  discoveries  have  also  been  reported  with  respect  to  many  ASVAB  test  items.  In  those  results, 
it  is  clear  that  separate  wrong  answers  given  as  alternatives  provide  us  with  differential  information, 
which  can  be  useful  in  ability  estimation  in  the  sense  that  it  will  substantially  increase  the  accuracy  of 
estimat  ion. 
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ITEM  17 
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FIGURE  7-4 

Example  of  the  Estimated  Operating  Characteristics  of  the  Correct  Answer  (Dotted  Line) 
and  of  the  Three  Distractors  (Dashed  Lines)  Obtained  by  the  Simple  Sum  Procedure  of 
the  Conditional  P.D.F.  Approach  Combined  with  the  Normal  Approach  Method 
Together  with  the  One  for  the  Correct  Answer  Obtained  by  Assuming  the 
Normal  Ogive  Model  (Solid  Line)  Taken  from  the  Level  11  Vocabulary 
Subtest  of  the  Iowa  Tests  of  Basic  Skills. 


Example  of  the  Operating  Characteristic  of  the  Correct  Answer  in  Model  A  (Solid  Line) 
Together  with  One  in  the  Normal  Ogive  Model  (Dotted  Line). 
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[VII. 4]  Merits  of  the  Nonparametric  Approach  for  the  Identification  of 

Informative  Distractors  and  for  the  Estimation  of  the  Operating 
Characteristics  of  an  Item 

Methods  and  approaches  developed  for  estimating  the  operating  characteristics  of  discrete  item 
responses  without  assuming  any  mathematical  form  (cf.  Section  2.3;  Samejima,  1981,  1990)  enable  us  to 
find  out  whether  or  not  a  given  incorrect  alternative  answer  to  a  multiple-choice  test  item  is  informative 
in  the  sense  that  it  contributes  to  the  increment  in  the  accuracy  in  the  estimation  of  the  individual’s 
ability.  Recently,  the  author  proposed  a  new  approach,  which  is  called  Differential  Weight  Procedure 
of  the  Conditional  P.D.F.  Approach,  and  which  has  been  described  in  the  preceding  chapter.  Although 
we  need  more  research  for  improving  the  fitnesses  further,  those  results  obtained  so  far  give  us  promises 
for  success  in  identifying  informative  distractors  and  in  estimating  their  operating  characteristics. 

Item  analysis  has  a  long  history,  starting  from  the  classical  proportion  correct  and  item-test  regres¬ 
sion.  In  the  context  of  latent  trait  models,  the  operating  characteristics  and  the  information  functions 
have  provided  us  with  powerful  tools.  Now  we  can  add  the  plausibility  functions  of  the  distractors  to 
this  category.  By  accurately  identifying  the  configuration  of  the  operating  characteristics  of  the  correct 
answer  and  the  distractors,  we  shall  be  able  to  understand  the  characteristics  of  the  item,  its  strengths 
and  weaknesses.  In  this  way  modifications  of  the  item  can  be  done  if  necessary.  Successful  nonpara¬ 
metric  methods  of  estimating  the  operating  characteristics  are  essential,  therefore,  for  this  new,  more 
informative  approach  to  the  item  analysis. 

[VII. 5]  Efficiency  in  Ability  Estimation  and  Strategies  of  Writing  Test 
Items 

Observations  and  discussion  made  in  the  preceding  sections  give  us  much  useful  information  as 
well  as  warnings.  First  of  all,  theoretical  observations  indicate  that  non-monotonicity  of  the  operating 
characteristic  of  the  correct  answer  to  the  multiple-choice  test  item  is  a  natural  consequence  of  theory. 
Secondly,  it  has  been  shown  from  several  different  angles  that  the  third  parameter,  cg  ,  in  the  three- 
parameter  model  provides  us  with  nothing  but  noise;  the  greater  the  value  of  cg  the  more  noise 
and  inaccuracies  in  estimation  it  produces.  Thirdly,  it  has  been  pointed  out  that,  although  it  is  still 
a  common  procedure  for  researchers  to  mold  the  operating  characteristics  of  the  correct  answers  of 
their  multiple-choice  test  items  into  the  three-parameter  logistic  model,  some  nonparametric  methods 
applied  to  empirical  data  have  revealed  the  non-monotonicity  of  the  operating  characteristic  of  the 
correct  answer  with  many  actual  test  items,  as  well  as  differential  information  provided  by  separate 
distractors.  Fourthly,  it  has  been  pointed  out  that  the  nonparametric  approach  to  the  estimation  of  the 
operating  characteristics  of  discrete  responses  has  been  successful  enough  to  detect  the  non-monotonicity 
of  the  function  when  it  exists,  and  to  approximate  their  rather  irregular  curves  fairly  accurately. 

With  all  these  facts,  it  is  time  to  reconsider  conventional  strategies  for  item  writing  and  to  propose 
new  strategies. 

The  first  thing  we  need  to  reconsider  is  the  lack  of  sufficient  interactions  between  theorists  and 
people  who  write  test  items.  It  nas  been  fairly  common  that:  1)  a  committee  is  organized  for  writing 
test  items  in  a  specified  content  area  or  domain  and  eventually  produces  a  set  of  test  items;  2)  another 
group  of  people  tests  these  items  on  a  small  sample  of  subjects,  screens  the  items  and  then  administers 
the  selected  items  to  larger  groups  of  subjects.  Item  calibration  is  done  on  the  second  stage,  assuming 
some  model  such  as  the  three-parameter  logistic  model,  etc.  In  most  cases,  there  is  practically  no 
feedback  from  theorists  to  item  writers.  If  we  set  a  strategy  that  more  interactions  are  made  between 
the  two  groups  of  people  so  that  the  test  items  are  revised  and  pilot  tested  with  each  interaction,  we 
shall  be  able  to  improve  the  test,  and  the  improvement  will  lead  to  efficiency  in  ability  estimation. 

The  second  thing  we  need  to  reconsider  is  the  simpleminded  avoidance  of  non-monotonicity  of  the 
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FIGURE  7-6 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  ag  =  1.5  ,  6j  =  —2.0  ,  fe2  =  — 1.0  , 

63  =  0.0  ,  44  =  1.0  and  65  =  2.0  . 


THETA 

FIGURE  7-7 


Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
in  the  FYee-Response  Situation  Following  the  Logistic  Model  on  the  Graded  Response 
Level,  with  the  Parameter  Values:  a„  =  1.5  ,  4j  =  -2.0  ,  62  =  —1.0  , 

43  =  0.0  ,  44  =  1,0  and  65  =  2.0  . 
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Operating  Characteristics  of  the  Correct  Answer  Obtained  by  the  Five  Different 
Redichotomizations  of  the  Graded  Test  Item  Following  the  Logistic  Model,  with 
the  Discrimination  Parameter,  ag  =  1.5  ,  and  the  Difficulty  Parameters, 

61  —  —2.0  ,  =  — 1-0  ,  63  —  0.0  ,  =  1.0  and  fcj  =  2.0  , 

Respectively. 


operating  characteristic  of  the  correct  answer.  While  it  is  not  desirable  for  an  item  to  have  higher 
conditional  probabilities  of  the  correct  answer  on  lower  levels  of  ability  than  on  higher  levels,  selecting 
alternative  answers  so  that  the  dips  of  the  operating  characteristic  of  the  correct  answer  be  smoothed 
out  will  lead  to  a  substantially  large  value  of  the  lower  asymptote  of  the  operating  characteristic  in  most 
cases.  We  must  recall  that  even  a  small  number  like  0.2  as  c9  in  the  three-parameter  logistic  model 
is  a  big  nuisance,  as  was  discussed  in  Section  7.2.  Our  strategy  must  be  that  we  make  the  best  use  of 
those  dips,  instead  of  avoiding  them. 

Figure  7-6  presents  the  operating  characteristics  of  the  five  alternative  answers  of  a  hypothesized 
test  item  following  Model  B  (Samejima,  1979b),  with  the  parameter  values:  ae  =  1.50  ,  fcj  —  —2.00  , 
i2  =  —1.00  ,  63  =  0.00  ,  64  =  1.00  and  65  =  2.00  .  The  subscript  for  each  of  the  five  difficulty 
parameters  indicates  the  order  of  easiness  for  the  examinee  to  be  attracted  to  the  plausibility  of  each 
alternative  answer,  so  that,  in  this  example,  65  indicates  the  difficulty  parameter  of  the  correct 
answer.  We  can  see  in  this  figure  that  a  practical  monotonicity  exists  for  the  operating  characteristic 
of  the  correct  answer  for  the  range  of  9  ,  (  —  0.5,  00)  ,  and,  more  importantly,  within  this  range  of  9 
its  lower  asymptote  is  very  close  to  zero,  i.e.,  the  nuisance  caused  by  the  non-zero  lower  asymptote  will 
be  gone  as  far  as  we  administer  the  item  to  populations  of  subjects  whose  ability  distributes  on  higher 
levels  than  9  =  —0.5  . 

These  operating  characteristics  of  the  five  alternative  answers  in  Figure  7-6  are  originated  from  those 
in  the  logistic  model  on  the  graded  response  level  (Samejima,  1969,  1972)  with  the  same  parameter  values 
(cf.  Samejima,  1979b).  Figure  7-7  presents  the  corresponding  set  of  operating  characteristics  of  the 
correct  answers  in  the  logistic  model.  We  notice  there  is  an  additional  strictly  decreasing  curve  in  this 
figure.  This  curve  represents  the  conditional  probability,  given  9  ,  that  the  examinee  does  not  find 
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attractiveness  in  any  alternative  answers.  In  Model  B,  these  people  are  assumed  to  guess  randomly,  so 
in  Figure  7-6  this  curve  does  not  exist,  and  the  conditional  probability  is  evenly  distributed  among  the 
five  alternative  answers  to  account  for  the  rises  in  their  operating  characteristics  at  lower  levels  of  6  . 

Figure  7-8  presents  the  operating  characteristics  of  the  correct  answer  following  the  logistic  model 
on  the  dichotomous  response  level,  which  are  obtained  by  the  five  different  redichotomizations  of  the 
graded  test  item  exemplified  in  Figure  7-7.  In  these  functions,  ag  =  1.5  is  the  common  discrimination 
parameter,  and  the  difficulty  parameters  are:  bg  =  —2.0  ,—1.0  ,0.0 , 1.0  ,2.0  ,  respectively.  This  is  the 
starting  point  of  the  graded  response  model,  which  leads  to  the  operating  characteristics  illustrated  in 
Figure  7-7  (cf.  Samejima,  1969,  1972). 

Suppose  that  two  alternative  answers  which  attract  examinees  of  low  levels  of  8  are  replaced, 
and  the  revised  item  has  fq  =  —3.0  and  62  =  —1-5  ,  respectively.  In  this  situation,  the  operating 
characteristics  of  the  correct  answer  obtained  by  the  first  two  redichotomizations  are  changed.  Figure 
7-9  presents  the  set  of  operating  characteristics  for  this  revised  test  item  following  Model  B.  In  this 
figure  we  can  see  that  the  operating  characteristic  of  the  correct  answer  is  practically  strictly  increasing 
within  the  range  of  8  ,  (— 1.7,  oo)  ,  and  the  pseudo  lower  asymptote  of  the  operating  characteristic 
within  this  range  of  8  is  still  very  close  to  zero. 

A  big  gain  resulting  from  this  revision  is  the  fact  that  the  lower  endpoint  of  the  interval  of  8  in  which 
the  operating  characteristic  of  the  correct  answer  is  practically  monotonic  has  substantially  shifted  to 
the  negative  direction,  while  still  keeping  its  lower  asymptote  practically  zero.  Thus  we  can  avoid  the 
noise  coming  from  the  lower  asymptote  even  if  we  administer  the  item  to  populations  of  examinees 
whose  ability  distributions  are  located  on  lower  levels  of  8  .  In  other  words,  without  sacrificing  the 
accuracy  of  ability  estimation,  the  utility  of  the  item  has  been  substantially  enhanced  by  this  revision. 

The  above  example  suggests  the  following  strategy. 

(1)  If  the  nonparametrically  estimated  operating  characteristic  of  the  correct  answer  to 
an  item  provides  us  with  a  relatively  high  value  of  8  below  which  monotonicity  does 
not  exist,  then  change  the  set  of  distractors  to  include  one  or  more  wrong  answers 
that  attract  examinees  of  very  low  levels  of  ability. 

It  may  sound  difficult  to  do  in  practice.  If  we  pay  attention  to  actually  used  multiple-choice  test 
items,  however,  we  will  come  across  many  wrong  alternative  answers  that  are  attracting  examinees  of 
very  low  levels  of  ability.  To  give  an  example,  the  author  has  come  across  an  arithmetic  item  asking  for 
the  area  of  a  rectangle.  A  substantial  number  of  seventh  graders  chose  the  wrong  alternative  answer 
which  equals  the  sum  of  the  two  sides  of  the  rectangle  of  different  lengths!  It  is  obvious  that  those  who 
did  not  understand  how  to  obtain  the  area  of  a  rectangle  at  all  chose  this  alternative  answer. 

Another  consideration  which  is  important  in  writing  test  items  is  to  keep  the  pseudo  lower  asymptote 
of  the  operating  characteristic  of  the  correct  answer  close  enough  to  zero,  as  is  the  case  with  the  above 
example.  This  has  a  great  deal  to  do  with  the  discrimination  powers  of  the  alternative  answers,  as  well  as 
the  configuration  of  the  plausibility  functions.  Figure  7-10  presents  the  set  of  operating  characteristics 
corresponding  to  Figure  7-6,  by  changing  the  discrimination  parameter  from  aa  —  1.5  to  a,,  =  1.0, 
while  keeping  the  five  difficulty  parameters  unchanged.  If  we  compare  Figure  7-10  with  Figure  7-6,  we 
can  see  a  substantial  enhancement  of  the  pseudo  lower  asymptote  within  the  interval  of  8  ,  (-0.5,  oo)  , 
i.e.,  the  nuisance  has  been  increased  by  the  change  in  the  discrimination  parameter. 

This  suggests  the  second  strategy: 

(2)  If  possible,  try  to  include  distractors  whose  estimated  operating  characteristics  are 
steep,  while  keeping  the  differential  configuration  of  these  functions  as  suggested  in  (1). 

So  far  our  strategies  have  been  focused  upon  producing  an  informative  operating  characteristic  of 
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FIGURE  7-9 

Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  aa  —  1.5  ,  6j  =  -3.0  ,  62  =  —1.5  , 


THETA 

FIGURE  7-10 


Operating  Characteristics  of  the  Five  Alternative  Answers  of  a  Hypothetical  Test  Item 
Following  Model  B,  with  the  Parameter  Values:  ag  —  1.0  ,  bi  =  —2.0  ,  f>2  =  — 1.0  , 

63  =  0.0  ,  bi  =  1.0  and  65  =  2.0  . 
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the  correct  answer.  We  notice,  however,  that  these  strategies  will  also  provide  us  with,  distractors  which 
provide  us  with  differential  information.  This  implies  that  approximation  of  the  nonparametrically  esti¬ 
mated  operating  characteristics  of  one  or  more  alternative  answers  by  some  mathematical  formulae  will 
enable  us  to  use  this  additional  differential  information  in  ability  estimation.  This  posterior  parameter¬ 
ization  of  the  non-parametrically  estimated  operating  characteristics  of  distractors  will  certainly  lead 
us  to  increased  accuracy  and  efficiency  in  ability  measurement. 

[VII. 6]  Discussion  and  Conclusions 

In  this  chapter,  the  shortages  of  the  conventional  way  of  handling  the  multiple-choice  test  have 
been  summarized,  and  also  theories  and  methodologies  that  can  be  applied  for  a  better  handling  of  the 
multiple-choice  test  item  have  been  described;  some  empirical  facts  have  been  introduced  to  support  the 
theoretical  observations;  finally,  new  strategies  of  item  writing  have  been  proposed  which  will  reduce 
noise  and  lead  to  more  efficient  ability  estimation. 

In  spite  of  many  controversies  against  the  multiple-choice  test,  because  of  its  economy  in  scoring 
it  has  been,  and  still  is,  very  popular  among  people  of  psychological  and  educational  measurement. 
Fortunately,  theorists  in  mathematical  psychology  have  developed  many  new  ideas  and  methodologies  in 
the  past  couple  of  decades  that  can  improve  the  way  of  handling  the  multiple-choice  test.  Nonparametric 
approach  in  estimating  the  operating  characteristic  is  one  of  them.  Also  the  rapid  progress  in  electronic 
technologies  has  made  it  possible  to  materialize  these  results  of  theories  and  methodologies  in  practical 
situations.  Today,  we  are  in  a  position  to  take  advantage  of  all  these  accomplishments. 
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VIII  Efficient  Computerized  Adaptive  Testing 

In  the  previous  chapters,  various  research  findings  obtained  in  the  present  research  period  have 
been  introduced  and  discussed.  All  of  these  results  are  beneficial  for  computerized  adaptive  testing, 
especially  in  increasing  its  efficiency.  This  chapter  will  summarize  observations  as  to  how  these  findings 
and  developments  can  be  applied  in  computerized  adaptive  testing. 

[VIII. 1]  Validity  Measures  Tailoring  a  Sequential  Subset  of  Items  for  an 

Individual 

The  item  information  function,  Ig(8)  ,  has  been  used  in  the  computerized  adaptive  testing  in 
selecting  an  optimal  item  to  tailor  a  sequential  subtest  of  items  for  an  individual  examinee  out  of  the 
,  rearranged  itempool.  A  procedure  may  be  to  let  the  computer  choose  an  item  having  the  highest  value 
of  Ig{8)  at  the  current  estimated  value  of  8  for  the  individual  examinee,  which  is  based  upon  his 
responses  to  the  items  that  have  already  been  presented  to  him  in  sequence,  out  of  the  set  of  remaining 
items  in  the  itempool. 

We  notice  from  (5.6)  or  (5.8)  in  Section  5.2  that  this  procedure  is  also  supported  from  the  standpoint 
of  maximizing  the  criterion-oriented  validity,  for  the  item  which  provides  us  with  the  greatest  item 
information  /„ (8)  among  all  the  available  items  in  the  itempool  also  gives  the  greatest  values  of  / J  (f) 
and  its  square  root,  at  any  fixed  value  of  8  . 

[VIII. 2]  Use  of  the  Modifications  of  the  Test  Information  Function  in  Stop¬ 
ping  Rules 

It  is  a  big  advantage  of  the  modern  mental  test  theory  over  classical  mental  test  theory  that  the 
standard  error  of  estimation  can  locally  be  defined  by  means  of  [/(5)]-1/2  ,  which  does  not  depend  upon 
the  population  of  examinees,  but  is  solely  a  property  of  the  test  itself.  Using  this  characteristic,  it  has 
been  observed  (Samejima,  1977)  that  in  computerized  adaptive  testing  the  amount  of  test  information 
can  be  used  effectively  in  the  stopping  rule  indicating,  locally,  the  desirable  accuracy  of  estimation  of  the 
examinee's  ability,  provided  that  our  itempool  contains  a  large  number  of  items  whose  difficulty  levels 
distribute  widely  over  the  range  of  6  of  interest.  A  procedure  may  be  to  terminate  the  presentation 
of  a  new  item  out  of  the  itempool  to  the  individual  examinee  when  1(8)  has  reached  an  a  priori  set 
amount  at  the  current  value  of  his  estimated  6  . 

We  notice  that,  in  general,  for  the  stopping  rule  in  computerized  adaptive  testing  the  modified 
test  information  functions,  T(5)  and  5(5)  ,  will  serve  better  than  the  original  1(8)  ,  for  in  many 
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practical  situations  our  itempool  is  more  or  less  limited.  In  particular,  it  is  usual  that  there  are  not 
so  many  optimal  items  for  examinees  whose  ability  levels  are  close  to  the  upper  or  the  lower  end  of 
the  configuration  of  the  difficulty  parameters  of  the  items  in  the  itempool.  In  such  a  case,  even  if  the 
amount  of  test  information  has  reached  a  certain  criterion  level,  it  does  not  mean  that  their  ability 
levels  are  estimated  with  the  same  accuracy  as  those  of  individuals  of  intermediate  ability  levels,  as  was 
pointed  out  in  Chapter  3.  Since,  taking  the  MLE  bias  function  into  consideration,  the  two  modified 
test  information  functions,  Y(0)  and  E($)  ,  are  based  upon  a  more  meaningful  minimum  bound  of  the 
conditional  variance  and  upon  a  minimum  bound  of  the  mean  squared  error  of  the  maximum  likelihood 
estimator,  respectively,  they  will  be  effectively  used  as  the  replacement  of  1(9)  in  stopping  rules  of 
computerized  adaptive  testing. 

The  test  information  function  1(9)  and  its  two  modification  formulae,  T(0)  and  E(0)  ,  are  likely 
to  be  the  ones  exemplified  in  the  lower  graph  of  Figure  3-5  for  an  individual  examinee  in  the  process 
of  adaptive  testing,  provided  that  the  program  for  the  test  is  written  well.  We  should  expect  visible 
differences  between  the  results  obtained  by  using  1(6)  and  by  using  one  of  its  modification  formulae, 
therefore,  especially  for  subjects  whose  ability  levels  are  close  to  the  upper  or  lower  end  of  the  ability 
interval  of  interest.  It  is  expected  that  these  individuals  will  be  required  to  take  more  test  items  in 
order  to  make  the  accuracy  of  the  estimation  of  9  comparable  to  that  of  examinees  of  intermediate 
ability  levels:  a  fact  that  could  not  have  been  disclosed  without  T(0)  and  E(0)  . 

We  need  to  investigate  this  topic  in  the  future,  specifying  the  amount  of  improvement  with  simulated 
and  empirical  data  collected  in  computerized  adaptive  testing. 

[VIII. 3]  Use  of  Test  Validity  Measures  in  Stopping  Rules 

When  we  have  a  specific  criterion  variable  "j  in  mind,  it  is  justified  to  use  an  a  priori  set  value  of 
/*($•)  instead  of  1(9)  in  the  stopping  rule  of  computerized  adaptive  testing.  In  so  doing,  we  can  obtain 
the  value  of  1(9)  corresponding  to  the  a  priori  set  value  of  7*(f)  for  each  9  ,  through  the  formula 

(81)  /(*W(f)  . 

which  is  obtained  from  (5.9)  in  Section  5.2.  Thus  it  is  easy  to  have  the  computer  handle  this  situation, 
provided  that  we  know  the  functional  formula  for  f(0)  . 

We  notice  that  the  test  validity  measures  proposed  in  the  present  research  (cf.  Chapter  5)  can  be 
modified,  if  we  replace  the  test  information  function  1(9)  by  one  of  its  modification  formulae,  T(0)  and 
E(9)  ,  which  have  also  been  proposed  in  the  present  research  (cf.  Chapter  3).  This  will  be  pursued  in 
the  future,  when  the  characteristics  of  these  two  modified  test  information  functions  have  further  been 
pursued  and  clarified.  It  is  quite  possible  that  the  new  test  validity  measures  can  effectively  be  used  in 
stopping  rules  of  computerized  adaptive  testing. 

[VIII. 4]  Prediction  of  the  Reliability  Coefficient  for  a  Specific  Population 
of  Examinees  in  Computerized  Adaptive  Testing 

It  has  also  been  observed  (Samejima,  1977)  that  in  computerized  adaptive  testing  we  can  predict 
the  reliability  coefficient  if  a  specified  amount  of  test  information  is  used  for  the  stopping  rule  for  a 
given  level  of  ability  in  each  of  the  test  and  retest  situations,  provided  that  the  two  conditions  1)  and 
2)  described  in  Section  4.2  are  met.  In  such  a  case,  we  can  write 

(8.2)  Corr.(9u§2)  =  \Var\9, )  -  E\{I(  , ,  (*)}"' l]]lVar.(j, )  {Var.(h)  ~  £[{  /, , ,  (0) } ~  1 1 
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+  £[{/(2)w}-i]}ri/2, 

where  /(i)(0)  and  /(2j(0)  are  the  preset  criterion  test  information  functions  in  the  test  and  retest 
situations,  respectively,  which  are  adopted  as  the  stopping  rules  for  the  two  separate  situations.  Note 
that  these  two  criterion  test  information  functions  need  not  be  the  same,  and  also  that  the  reliability 
coefficient  is  obtainable  from  a  single  administration.  In  a  simplified  case  where,  in  each  situation,  the 
same  amount  of  test  information  is  used  as  the  criterion  for  terminating  the  presentation  of  new  items 
for  every  examinee,  we  can  rewrite  the  above  formula  into  the  form 

(8.3)  Corr.(M2)  =  \Var\6y)  -  a\\\V ar\6,){Var\h)  ~  + 'l)]-1'2  , 

where  cr\  and  <r2  are  the  reciprocals  of  the  constant  amounts  of  criterion  test  information  in  the 
two  separate  situations,  respectively.  If  we  use  the  same  constant  amount  of  test  information  as  the 
stopping  rule  in  both  the  test  and  retest  situations,  then  the  reliability  coefficient  takes  the  simplest 
form 

(8.4)  Corr.(6uh)  =  !Vor.(#x)  -  a^Var.^)]'1  , 

where  a2  denotes  the  reciprocal  of  this  common  constant  amount  of  test  information. 

Also  in  computerized  adaptive  testing,  either  T(0)  or  E(0)  can  be  used  as  the  stopping  rule  in 
place  of  the  test  information  function  1(6)  ,  and  we  can  revise  (8.2)  into  the  forms 

(8.5)  Cor r.(M2)  =  [Var.(^)  -  ^(^(^(flJJ-MHVar.^JtVar.^)  -  £[{T(1)(0)}- l] 

+  SfnW*)}-1)}]'172  , 

and 

(8.6)  Corr.(6uh)  =  |Vor.(^)  -  £■[{=,!, (S))_l]]|Var.(S1){Var.(^1)  -  £[{5(1)(0)}-1] 

+  £[{S(2)(*)}-1]}r1/2, 

where  the  subscripts  1  and  2  represent  the  test  and  retest  situations,  respectively. 

[VIII. 5]  Differential  Weight  Procedure  for  Item  Analysis  and  for  On-Line 
Item  Calibration 

It  is  obvious  that  item  analysis  in  the  true  sense  of  the  word  starts  from  the  accurate  estimation  of 
the  operating  characteristics  of  the  item  responses.  Thus  the  nonparametric  estimation  of  the  operating 
characteristic  offers  a  great  deal  of  information  about  an  item,  when  it  is  successful.  In  this  sense  we 
can  say  that  the  Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach  (cf.  Chapter  6) 
provides  us  with  promise  for  the  successful  item  analysis  in  general. 
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For  the  success  in  adaptive  testing,  it  is  essential  to  create  a  good  initial  itempool.  Differential 
Weight  Procedure  can  effectively  be  used  in  selecting  appropriate  test  items  for  the  itempool,  applied 
repeatedly  in  pilot  studies. 

Differential  Weight  Procedure  will  especially  be  useful  for  the  on-line  item  calibration  in  computer¬ 
ized  adaptive  testing.  When  we  use  an  adaptive  test,  it  is  necessary  to  discard  certain  test  items  from 
our  itempool  after  they  have  been  administered  too  frequently,  or  too  seldom,  and  replace  them  by  new 
test  items.  In  so  doing,  we  need  to  on-line  calibrate  these  new  test  items,  and  successful  nonparametric 
estimation  methods  adjusted  to  this  situation  will  be  most  valuable  in  order  to  discover  the  operating 
characteristics  of  these  new  test  items. 

Many  computer  programs  have  been  written  in  the  present  research,  in  order  to  materialize  this  new 
method,  and  to  put  the  theory  and  methodologies  in  practice.  In  developing  this  method  further,  it 
will  be  the  focus  of  research  to  pursue  methodologies  for  estimating  differential  weight  functions  under 
different  circumstances.  It  should  also  be  noted  that  we  need  to  develop  efficient  computer  programs 
for  smoothing  out  the  irregularities  of  the  differential  weight  function  whenever  it  is  needed. 

Once  the  operating  characteristics  of  the  test  items  have  been  discovered,  however,  it  will  be  wise 
to  search  for  appropriate  mathematical  forms  in  order  to  mathematically  simplify  them  by  parameter¬ 
ization.  In  so  doing,  observations  and  mathematical  models  introduced  in  Chapter  7  will  be  useful, 
especially  in  dealing  with  non-monotonic  operating  characteristics  or  those  which  are  strictly  increasing 
but  converging  to  some  values  less  than  unity. 

[VIII. 6]  Use  of  Informative  Distractors 

One  of  the  future  directions  of  the  computerized  adaptive  testing  will  be  the  use  of  information 
coming  from  the  distractors  of  the  multiple-choice  test  item,  as  well  as  from  the  correct  answer.  This 
will  certainly  increase  the  item  information  both  locally  and  in  total,  and,  as  the  result,  the  estimation 
of  the  individual  examinee’s  ability  will  become  more  efficient. 

For  this  reason,  an  accurate  estimation  of  the  plausibility  functions  of  the  distractors  of  multiple- 
choice  test  items  becomes  very  important  for  the  future  of  computerized  adaptive  testing.  In  this 
context,  again,  Differential  Weight  Procedure  of  the  Conditional  P.D.F.  Approach  will  take  an  important 
role,  for  it  will  be  used  not  only  for  estimating  the  operating  characteristics  of  correct  answers  but  of 
any  discrete  item  responses,  including  the  distractors  of  multiple-choice  test  items. 

Also  the  content-based  observation  of  informative  distractors,  which  has  been  described  in  Chapter 
7,  will  become  useful  and  important.  The  suggested  strategies  of  writing  test  items  (cf.  Section  7.5)  can 
readily  be  adopted  in  the  construction  of  itempools  as  well  as  in  on-line  item  calibration  in  the  future 
research. 

[VIII. 7]  Discussion  and  Conclusions 

The  above  sections  have  summarized  the  research  accomplishments  which  will  directly  contribute 
to  the  computerized  adaptive  testing.  Since  each  accomplishment  has  been  observed  and  discussed  in 
detail  in  the  previous  chapters,  this  chapter  has  to  be  brief. 

Efficient  computerized  adaptive  testing  is  one  of  the  main  objectives  of  the  present  research.  The 
author  has  been  pleased  to  introduce  these  accomplishments  that  will  benefit  it  from  various  angles. 
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IX  Other  Findings  in  the  Present  Research 

There  are  many  other  research  findings  in  the  present  research  which  have  not  been  reported  in  the 
ONR  research  reports.  They  concern  those  topics  that  are  still  being  pursued,  or  that  will  find  their 
places  in  a  more  comprehensive  framework  in  the  future  research. 

Among  those  research  findings  are  those  of  winsorization  of  the  outliers  of  the  maximum  likelihood 
estimates  of  6  adopted  in  the  process  of  the  Simple  Sum  Procedure  of  the  Conditional  P.D.F.  Approach 
for  estimating  the  operating  characteristics  of  discrete  item  responses.  The  results  turned  out  to  be 
fairly  successful.  We  still  need  further  research  on  this  subject,  however,  before  we  can  evaluate  this 
variation  of  the  Simple  Sum  Procedure. 

Some  considerations  and  observations  have  also  been  made  concerning  possible  applications  of  the 
theories  and  methodologies  developed  so  far  in  the  area  of  latent  trait  models.  They  include  the  latent 
trait  approach  to  Rorschach  diagnosis  based  upon  the  Burstein-Loucks  scoring  system,  and  the  prospect 
of  applying  latent  trait  models  and  methodologies  accommodating  both  psychological  and  neurological 
factors  (cf.  Chapter  1). 
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